Improved Peptide Identification Sensitivity using Meta-Search, Grid-Computing and Machine-Learning with Application to Genome Annotation |
|
|---|---|
|
|
|
| Launch in standalone player | |
| Air date: | Friday, September 18, 2009, 10:00:00 AM Time displayed is Eastern Time, Washington DC Local |
| Category: | Proteomics |
| Description: | Mass spectrometry based proteomics experiments provide direct experimental evidence for the amino-acid sequence of functional proteins and their isoforms, evidence that is not available from other high-throughput experimental techniques. Large scale proteomics datasets often contain strong evidence for novel, unexpected, or poorly annotated proteins and protein isoforms, but this evidence is typically missed by currently available tools for peptide identification from tandem mass spectra. We describe a variety of techniques designed to increase the sensitivity and scope of identified peptides so that this evidence is not lost, but is instead available for evaluation alongside other types of experimental and statistical evidence for genome annotation.
The PepArML meta-search engine demonstrates that multiple tandem mass spectrometry search engines, heterogeneous grid-computing, and unsupervised machine-learning result reconciliation can substantially improve the number of high-confidence peptide identifications from tandem mass spectra datasets. The inclusive peptide sequence database PepSeqDB makes searching EST, mRNA, and other sources of putative peptide sequence fast and easy, and facilitates real-time projection of ad-hoc peptide sequences back to source evidence and annotation tracks in the UCSC genome browser. The PepArML MS/MS meta-search engine computes peptide identifications using the Mascot, X!Tandem, X!Tandem with KScore scoring plugin, OMSSA, and MyriMatch search engines, automatically reformatting spectral data and constructing search configurations for each search engine from a simple, unified search specification. Searches are automatically scheduled on a heterogeneous mix of local and remote compute nodes, including the Edwards Lab cluster at Georgetown and NSF TeraGrid compute resources at Purdue. Results from target and decoy searches are reconciled using the PepArML machine-learning based result combiner. The grid-search infrastructure scales readily to hundreds of compute nodes, while the machine-learning based result combiner can increase the number of peptide-spectrum assignments at fixed FDR two to three fold. We will demonstrate that these publicly available tools, applied to in-house and publicly available datasets, can provide significant evidence for novel, unexpected, and poorly annotated proteins and protein isoforms, and provide a cheap, effective way to improve the quality of genome annotations http://proteome.nih.gov |
| Author: | Nathan Edwards, Ph.D., Georgetown University |
| Runtime: | 75 minutes |
| CIT File ID: | 15280 |
| CIT Live ID: | 8047 |
| Permanent link: | http://videocast.nih.gov/launch.asp?15280 |