首页 | 官方网站   微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
We present a new probability-based method for protein identification using tandem mass spectra and protein databases. The method employs a hypergeometric distribution to model frequencies of matches between fragment ions predicted for peptide sequences with a specific (M + H)+ value (at some mass tolerance) in a protein sequence database and an experimental tandem mass spectrum. The hypergeometric distribution constitutes null hypothesis-all peptide matches to a tandem mass spectrum are random. It is used to generate a score characterizing the randomness of a database sequence match to an experimental tandem mass spectrum and to determine the level of significance of the null hypothesis. For each tandem mass spectrum and database search, a peptide is identified that has the least probability of being a random match to the spectrum and the corresponding level of significance of the null hypothesis is determined. To check the validity of the hypergeometric model in describing fragment ion matches, we used chi2 test. The distribution of frequencies and corresponding hypergeometric probabilities are generated for each tandem mass spectrum. No proteolytic cleavage specificity is used to create the peptide sequences from the database. We do not use any empirical probabilities in this method. The scores generated by the hypergeometric model do not have a significant molecular weight bias and are reasonably independent of database size. The approach has been implemented in a database search algorithm, PEP_PROBE. By using a large set of tandem mass spectra derived from a set of peptides created by digestion of a collection of known proteins using four different proteases, a false positive rate of 5% is demonstrated.  相似文献   

2.
Database-searching algorithms compatible with shotgun proteomics match a peptide tandem mass spectrum to a predicted mass spectrum for an amino acid sequence within a database. SEQUEST is one of the most common software algorithms used for the analysis of peptide tandem mass spectra by using a cross-correlation (XCorr) scoring routine to match tandem mass spectra to model spectra derived from peptide sequences. To assess a match, SEQUEST uses the difference between the first- and second-ranked sequences (ACn). This value is dependent on the database size, search parameters, and sequence homologies. In this report, we demonstrate the use of a scoring routine (SEQUEST-NORM) that normalizes XCorr values to be independent of peptide size and the database used to perform the search. This new scoring routine is used to objectively calculate the percent confidence of protein identifications and posttranslational modifications based solely on the XCorr value.  相似文献   

3.
The purpose of this work is to develop and verify statistical models for protein identification using peptide identifications derived from the results of tandem mass spectral database searches. Recently we have presented a probabilistic model for peptide identification that uses hypergeometric distribution to approximate fragment ion matches of database peptide sequences to experimental tandem mass spectra. Here we apply statistical models to the database search results to validate protein identifications. For this we formulate the protein identification problem in terms of two independent models, two-hypothesis binomial and multinomial models, which use the hypergeometric probabilities and cross-correlation scores, respectively. Each database search result is assumed to be a probabilistic event. The Bernoulli event has two outcomes: a protein is either identified or not. The probability of identifying a protein at each Bernoulli event is determined from relative length of the protein in the database (the null hypothesis) or the hypergeometric probability scores of the protein's peptides (the alternative hypothesis). We then calculate the binomial probability that the protein will be observed a certain number of times (number of database matches to its peptides) given the size of the data set (number of spectra) and the probability of protein identification at each Bernoulli event. The ratio of the probabilities from these two hypotheses (maximum likelihood ratio) is used as a test statistic to discriminate between true and false identifications. The significance and confidence levels of protein identifications are calculated from the model distributions. The multinomial model combines the database search results and generates an observed frequency distribution of cross-correlation scores (grouped into bins) between experimental spectra and identified amino acid sequences. The frequency distribution is used to generate p-value probabilities of each score bin. The probabilities are then normalized with respect to score bins to generate normalized probabilities of all score bins. A protein identification probability is the multinomial probability of observing the given set of peptide scores. To reduce the effect of random matches, we employ a marginalized multinomial model for small values of cross-correlation scores. We demonstrate that the combination of the two independent methods provides a useful tool for protein identification from results of database search using tandem mass spectra. A receiver operating characteristic curve demonstrates the sensitivity and accuracy level of the approach. The shortcomings of the models are related to the cases when protein assignment is based on unusual peptide fragmentation patterns that dominate over the model encoded in the peptide identification process. We have implemented the approach in a program called PROT_PROBE.  相似文献   

4.
Detection and identification of pathogenic bacteria and their protein toxins play a crucial role in a proper response to natural or terrorist-caused outbreaks of infectious diseases. The recent availability of whole genome sequences of priority bacterial pathogens opens new diagnostic possibilities for identification of bacteria by retrieving their genomic or proteomic information. We describe a method for identification of bacteria based on tandem mass spectrometric (MS/MS) analysis of peptides derived from bacterial proteins. This method involves bacterial cell protein extraction, trypsin digestion, liquid chromatography MS/MS analysis of the resulting peptides, and a statistical scoring algorithm to rank MS/MS spectral matching results for bacterial identification. To facilitate spectral data searching, a proteome database was constructed by translating genomes of bacteria of interest with fully or partially determined sequences. In this work, a prototype database was constructed by the automated analysis of 87 publicly available, fully sequenced bacterial genomes with the GLIMMER gene finding software. MS/MS peptide spectral matching for peptide sequence assignment against this proteome database was done by SEQUEST. To gauge the relative significance of the SEQUEST-generated matching parameters for correct peptide assignment, discriminant function (DF) analysis of these parameters was applied and DF scores were used to calculate probabilities of correct MS/MS spectra assignment to peptide sequences in the database. The peptides with DF scores exceeding a threshold value determined by the probability of correct peptide assignment were accepted and matched to the bacterial proteomes represented in the database. Sequence filtering or removal of degenerate peptides matched with multiple bacteria was then performed to further improve identification. It is demonstrated that using a preset criterion with known distributions of discriminant function scores and probabilities of correct peptide sequence assignments, a test bacterium within the 87 database microorganisms can be unambiguously identified.  相似文献   

5.
A widespread proteomics procedure for characterizing a complex mixture of proteins combines tandem mass spectrometry and database search software to yield mass spectra with identified peptide sequences. The same peptides are often detected in multiple experiments, and once they have been identified, the respective spectra can be used for future identifications. We present a method for collecting previously identified tandem mass spectra into a reference library that is used to identify new spectra. Query spectra are compared to references in the library to find the ones that are most similar. A dot product metric is used to measure the degree of similarity. With our largest library, the search of a query set finds 91% of the spectrum identifications and 93.7% of the protein identifications that could be made with a SEQUEST database search. A second experiment demonstrates that queries acquired on an LCQ ion trap mass spectrometer can be identified with a library of references acquired on an LTQ ion trap mass spectrometer. The dot product similarity score provides good separation of correct and incorrect identifications.  相似文献   

6.
Algorithmic search engines bridge the gap between large tandem mass spectrometry data sets and the identification of proteins associated with biological samples. Improvements in these tools can greatly enhance biological discovery. We present a new scoring scheme for comparing tandem mass spectra with a protein sequence database. The MASPIC (Multinomial Algorithm for Spectral Profile-based Intensity Comparison) scorer converts an experimental tandem mass spectrum into a m/z profile of probability and then scores peak lists from potential candidate peptides using a multinomial distribution model. The MASPIC scoring scheme incorporates intensity, spectral peak density variations, and m/z error distribution associated with peak matches into a multinomial distribution. The scoring scheme was validated on two standard protein mixtures and an additional set of spectra collected on a complex ribosomal protein mixture from Rhodopseudomonas palustris. The results indicate a 5-15% improvement over Sequest for high-confidence identifications. The performance gap grows as sequence database size increases. Additional tests on spectra from proteinase-K digest data showed similar performance improvements demonstrating the advantages in using MASPIC for studying proteins digested with less specific proteases. All these investigations show MASPIC to be a versatile and reliable system for peptide tandem mass spectral identification.  相似文献   

7.
Peptide identification based on tandem mass spectrometry and database searching algorithms has become one of the central technologies in proteomics. At the heart of this technology is the ability to reproducibly acquire high-quality tandem mass spectra for database interrogation. The variability in tandem mass spectra generation is often assumed to be minimal, and peptide identifications are typically based on a single tandem mass spectrum. In this paper, we characterize the variance of scores derived from replicate tandem mass spectra using several database search algorithms and demonstrate the effects of spectral variability on the correct identification of peptides. We show that the variance associated with the collection of tandem mass spectra can be substantial leading to sizable errors in search algorithm scores ( approximately 5-25% RSD) and ultimately incorrect assignments. Processing strategies are discussed to minimize the impact of tandem mass spectra variability on peptide identification.  相似文献   

8.
MALDI-quadrupole time-of-flight mass spectrometry was applied to identify proteins from organisms whose genomes are still unknown. The identification was carried out by successively searching a sequence database-first with a peptide mass fingerprint, then with a packet of noninterpreted MS/MS spectra, and finally with peptide sequences obtained by automated interpretation of the MS/MS spectra. A "MS BLAST" homology searching protocol was developed to overcome specific limitations imposed by mass spectrometric data, such as the limited accuracy of de novo sequence predictions. This approach was tested in a small-scale proteomic project involving the identification of 15 bands of gel-separated proteins from the methylotrophic yeast Pichia pastoris, whose genome has not yet been sequenced and which is only distantly related to other fungi.  相似文献   

9.
A statistical model for identifying proteins by tandem mass spectrometry   总被引:51,自引:0,他引:51  
A statistical model is presented for computing probabilities that proteins are present in a sample on the basis of peptides assigned to tandem mass (MS/MS) spectra acquired from a proteolytic digest of the sample. Peptides that correspond to more than a single protein in the sequence database are apportioned among all corresponding proteins, and a minimal protein list sufficient to account for the observed peptide assignments is derived using the expectation-maximization algorithm. Using peptide assignments to spectra generated from a sample of 18 purified proteins, as well as complex H. influenzae and Halobacterium samples, the model is shown to produce probabilities that are accurate and have high power to discriminate correct from incorrect protein identifications. This method allows filtering of large-scale proteomics data sets with predictable sensitivity and false positive identification error rates. Fast, consistent, and transparent, it provides a standard for publishing large-scale protein identification data sets in the literature and for comparing the results obtained from different experiments.  相似文献   

10.
Current algorithms for quantifying peptide identification confidence in the accurate mass and time (AMT) tag approach assume that the AMT tags themselves have been correctly identified. However, there is uncertainty in the identification of AMT tags, because this is based on matching LC-MS/MS fragmentation spectra to peptide sequences. In this paper, we incorporate confidence measures for the AMT tag identifications into the calculation of probabilities for correct matches to an AMT tag database, resulting in a more accurate overall measure of identification confidence for the AMT tag approach. The method is referenced as Statistical Tools for AMT Tag Confidence (STAC). STAC additionally provides a uniqueness probability (UP) to help distinguish between multiple matches to an AMT tag and a method to calculate an overall false discovery rate (FDR). STAC is freely available for download, as both a command line and a Windows graphical application.  相似文献   

11.
This paper presents application of sequential enhanced data processing procedures to high-resolution tandem mass spectra for identification of peptides using the Mascot database search algorithm. A strategy for (1) selection of fragment ion peaks from MS/MS spectra, (2) utilization of improved mass accuracy of the precursor ions, and (3) wavelet denoising of the mass spectra prior to fragment ion selection have been developed. The number of peptide identifications obtained using the enhanced processing was then compared with that obtained using software provided by the instrument manufacturer. Approximately 9000 MS/MS spectra acquired by the Applied Biosystems 4700 TOF/TOF MS instrument were used as a model data set. After application of the new processing, an increase of 33% unique peptides and 22% protein identifications with at least two unique peptides were found. The influence of the processing on the percentage of false positives, estimated by searching against a randomized database, was estimated to increase false positive identifications from 2.7 to 3.9%, which was still below the 5% error rate specified in the Mascot search. These data processing approaches increase the amount of information that can be extracted from LC-MS analysis without the necessity of additional experiments.  相似文献   

12.
A method for rapid and unambiguous identification of proteins by sequence database searching using the accurate mass of a single peptide and specific sequence constraints is described. Peptide masses were measured using electrospray ionization-Fourier transform ion cyclotron resonance mass spectrometry to an accuracy of 1 ppm. The presence of a cysteine residue within a peptide sequence was used as a database searching constraint to reduce the number of potential database hits. Cysteine-containing peptides were detected within a mixture of peptides by incorporating chlorine into a general alkylating reagent specific for cysteine residues. Secondary search constraints included the specificity of the protease used for protein digestion and the molecular mass of the protein estimated by gel electrophoresis. The natural isotopic distribution of chlorine encoded the cysteine-containing peptide with a distinctive isotopic pattern that allowed automatic screening of mass spectra. The method is demonstrated for a peptide standard and unknown proteins from a yeast lysate using all 6118 possible yeast open reading frames as a database. As judged by calculation of codon bias, low-abundance proteins were identified from the yeast lysate using this new method but not by traditional methods such as tandem mass spectrometry via data-dependent acquisition or mass mapping.  相似文献   

13.
Pan S  Gu S  Bradbury EM  Chen X 《Analytical chemistry》2003,75(6):1316-1324
Identification of proteins with low sequence coverage using mass spectrometry (MS) requires tandem MS/MS peptide sequencing. It is very challenging to obtain a complete or to interpret an incomplete tandem MS/MS spectrum from fragmentation of a weak peptide ion signal for sequence assignment. Here, we have developed an effective and high-throughput MALDI-TOF-based method for the identification of membrane and other low-abundance proteins with a simple, one-dimensional separation step. In this approach, several stable isotope-labeled amino acid precursors were selected to mass-tag, in parallel, the human proteome of human skin fibroblast cells in a residue-specific manner during in vivo cell culturing. These labeled residues can be recognized by their characteristic isotope patterns in MALDI-TOF MS spectra. The isotope pattern of particular peptides induced by the different labeled precursors provides information about their amino acid compositions. The specificity of peptide signals in a peptide mass mapping is thus greatly enhanced, resolving a high degree of mass degeneracy of proteolytic peptides derived from the complex human proteome. Further, false positive matches in database searching can be eliminated. More importantly, proteins can be accurately identified through a single peptide with its m/z value and partial amino acid composition. With the increased solubility of hydrophobic proteins in SDS, we have demonstrated that our approach is effective for the identification of membrane and low-abundant proteins with low sequence coverage and weak signal intensity, which are often difficult for obtaining informative fragment patterns in tandem MS/MS peptide sequencing analysis.  相似文献   

14.
There are several computer programs that can match peptide tandem mass spectrometry data to their exactly corresponding database sequences, and in most protein identification projects, these programs are utilized in the early stages of data interpretation. However, situations frequently arise where tandem mass spectral data cannot be correlated with any database sequences. In these cases, the unmatched data could be due to peptides derived from novel proteins, allelic or species-derived variants of known proteins, or posttranslational or chemical modifications. Two additional problems are frequently encountered in high-throughput protein identification. First, it is difficult to quickly sift through large amounts of data to identify those spectra that, due to poor signal or contaminants, can be ignored. Second, it is important to find incorrect database matches (false positives). We have chosen to address these difficulties by performing automatic de novo sequencing using a computer program called Lutefisk. Sequence candidates obtained are used as input in a homology-based database search program called CIDentify to identify variants of known proteins. Comparison of database-derived sequences with de novo sequences allows for electronic validation of database matches even if the latter are not completely correct. Modifications to the original Lutefisk program have been implemented to handle data obtained from triple quadrupole, ion trap, and quadrupole/time-of-flight hybrid (Qtof) mass spectrometers. For example, the linearity of mass errors due to temperature-dependent expansion of the flight tube in a Qtof was exploited such that isobaric amino acids (glutamine/lysine and oxidized methionine/ phenylalanine) can be differentiated without careful attention to mass calibration.  相似文献   

15.
With high-mass accuracy and consecutively obtained electron transfer dissociation (ETD) and higher-energy collisional dissociation (HCD) tandem mass spectrometry (MS/MS), reliable (≥97%) and sensitive fragment ions have been extracted for identification of specific amino acid residues in peptide sequences. The analytical benefit of these specific amino acid composition (AAC) ions is to restrict the database search space and provide identification of peptides with higher confidence and reduced false negative rates. The 6706 uniquely identified peptide sequences determined with a conservative Mascot score of >30 were used to characterize the AAC ions. The loss of amino acid side chains (small neutral losses, SNLs) from the charge reduced peptide radical cations was studied using ETD. Complementary AAC information from HCD spectra was provided by immonium ions. From the ETD/HCD mass spectra, 5162 and 6720 reliable SNLs and immonium ions were successfully extracted, respectively. Automated application of the AAC information during database searching resulted in an average 3.5-fold higher confidence level of peptide identification. In addition, 4% and 28% more peptides were identified above the significance level in a standard and extended search space, respectively.  相似文献   

16.
A MALDI QqTOF mass spectrometer has been used to identify proteins separated by one-dimensional or two-dimensional gel electrophoresis at the femtomole level. The high mass resolution and the high mass accuracy of this instrument in both MS and MS/MS modes allow identification of a protein either by peptide mass fingerprinting of the protein digest or from tandem mass spectra acquired by collision-induced dissociation of individual peptide precursors. A peptide mass map of the digest and tandem mass spectra of multiple peptide precursor ions can be acquired from the same sample in the course of a single experiment. Database searching and acquisition of MS and MS/MS spectra can be combined in an interactive fashion, increasing the information value of the analytical data. The approach has demonstrated its usefulness in the comprehensive characterization of protein in-gel digests, in the dissection of complex protein mixtures, and in sequencing of a low molecular weight integral membrane protein. Proteins can be identified in all types of sequence databases, including an EST database. Thus, MALDI QqTOF mass spectrometry promises to have remarkable potential for advancing proteomic research.  相似文献   

17.
Proteolytic peptide mass mapping as measured by mass spectrometry provides a major approach for the identification of proteins. A protein is usually identified by the best match between the measured and calculated m/z values of the proteolytic peptides. A unique identification is, however, heavily dependent upon the mass accuracy and sequence coverage of the fragment ions generated by peptide ionization. Without ultrahigh instrumental accuracy, it is possible to increase the specificity of the assignments of particular proteolytic peptides by the incorporation of selected amino acid residue(s) enriched with stable isotope(s) into the protein sequence. Here we report this novel method of generating residue-specific mass-tagged proteolytic peptides for accurate and efficient protein identification. Selected amino acids are labeled with 13C/15N/2H and incorporated into proteins in a sequence-specific manner during cell culturing. Each of these labeled amino acids carries a defined mass change encoded in its monoisotopic distribution pattern. Through their characteristic patterns, the peptides with mass tags can then be readily distinguished from other peptides in mass spectra. This method of identifying unique proteins can also be extended to protein complexes and will significantly increase data search specificity, efficiency, and accuracy for protein identifications.  相似文献   

18.
Lu B  Ruse C  Xu T  Park SK  Yates J 《Analytical chemistry》2007,79(4):1301-1310
We developed and compared two approaches for automated validation of phosphopeptide tandem mass spectra identified using database searching algorithms. Phosphopeptide identifications were obtained through SEQUEST searches of a protein database appended with its decoy (reversed sequences). Statistical evaluation and iterative searches were employed to create a high-quality data set of phosphopeptides. Automation of postsearch validation was approached by two different strategies. By using statistical multiple testing, we calculate a p value for each tentative peptide phosphorylation. In a second method, we use a support vector machine (SVM; a machine learning algorithm) binary classifier to predict whether a tentative peptide phosphorylation is true. We show good agreement (85%) between postsearch validation of phosphopeptide/spectrum matches by multiple testing and that from support vector machines. Automatic methods conform very well with manual expert validation in a blinded test. Additionally, the algorithms were tested on the identification of synthetic phosphopeptides. We show that phosphate neutral losses in tandem mass spectra can be used to assess the correctness of phosphopeptide/spectrum matches. An SVM classifier with a radial basis function provided classification accuracy from 95.7% to 96.8% of the positive data set, depending on search algorithm used. Establishing the efficacy of an identification is a necessary step for further postsearch interrogation of the spectra for complete localization of phosphorylation sites. Our current implementation performs validation of phosphoserine/phosphothreonine-containing peptides having one or two phosphorylation sites from data gathered on an ion trap mass spectrometer. The SVM-based algorithm has been implemented in the software package DeBunker. We illustrate the application of the SVM-based software DeBunker on a large phosphorylation data set.  相似文献   

19.
The prevailing method of analyzing tandem-MS data for protein identification involves the comparison of peptide molecular weight and fragmentation data to theoretically predicted values, based on known protein sequences in databases. This is generally effective since proteins from most species under study are in the database or have sufficient homology to allow significant matching. We have encountered difficulties identifying proteins from fungal species Alternaria alternata due to significant interspecies protein sequence differences (divergence) and its absence from the database. This common household mold causes asthma and allergy problems, but the genome has not been sequenced. De novo sequencing and error-tolerant methods can facilitate protein identifications in divergent, unsequenced species. But these standard methods can be laborious and only allow single amino acid substitution, respectively. We have developed an alternative approach focusing on database engineering, predicting biologically rational polymorphism using statistically weighted amino acid substitution information held in BLOSUM62. Like other second pass methods, it is based on the initially identified protein. However, this approach allows more control over sequences to be considered, including multiple changes per peptide. The results show considerable improvement for routine protein identification and the potential for rescuing otherwise unconvincing identifications in unusually divergent species.  相似文献   

20.
Intact protein biomarkers from Bacillus cereus T spores have been analyzed by high-resolution tandem Fourier transform ion cyclotron resonance mass spectrometry. Two techniques have been applied for excitation of the isolated multiply charged precursor ion species: sustained off-resonance irradiation/collisionally activated dissociation and electron capture dissociation. Fragmentation-derived sequence tags and BLAST sequence similarity proteome database searches allow unequivocal identification of the major biomarker protein with unprecedented specificity. Sequence-specific fragmentation patterns further confirm protein identification. Moreover, methodology combining accurate mass measurements of intact proteins with additional information contained in a proteome database permits tentative assignment of several other protein biomarkers isolated from the B. cereus T spores. We argue that approaches involving tandem MS of protein biomarkers, combined with bioinformatics, can drastically improve the specificity of individual microorganism identification, particularly in complex environments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号