×

A nested mixture model for protein identification using mass spectrometry. (English) Zbl 1194.62118

Summary: Mass spectrometry provides a high-throughput way to identify proteins in biological samples. In a typical experiment, proteins in a sample are first broken into their constituent peptides. The resulting mixture of peptides is then subjected to mass spectrometry, which generates thousands of spectra, each characteristic of its generating peptide. We consider the problem of inferring, from these spectra, which proteins and peptides are present in the sample. We develop a statistical approach to the problem, based on a nested mixture model. In contrast to commonly used two-stage approaches, this model provides a one-stage solution that simultaneously identifies which proteins are present, and which peptides are correctly identified. In this way our model incorporates the evidence feedback between proteins and their constituent peptides. Using simulated data and a yeast data set, we compare and contrast our method with existing widely used approaches (PeptideProphet/ProteinProphet) and with a recently published new approach, HSM. For peptide identification, our single-stage approach yields consistently more accurate results. For protein identification the methods have similar accuracy in most settings, although we exhibit some scenarios in which the existing methods perform poorly.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
65C60 Computational problems in statistics (MSC2010)

References:

[1] Blei, D., Gri, T., Jordan, M. and Tenenbaum, J. (2004). Hierarchical topic models and the nested chinese restaurant process. In Advances in Neural Information Processing Systems 18 . MIT Press.
[2] Choi, H. and Nesvizhskii, A. I. (2008). Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J. Proteome Res. 7 254-265.
[3] Coon, J. J., Syka, J. E., Shabanowitz, J. and Hunt, D. (2005). Tandem mass spectrometry for peptide and proteins sequence analysis. BioTechniques 38 519-521.
[4] Dempster, A., Laird, N. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1-38. · Zbl 0364.62022
[5] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. G. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160. · Zbl 1073.62511 · doi:10.1198/016214501753382129
[6] Elias, J., Faherty, B. and Gygi, S. (2005). Comparative evaluation of mass spectrometry platforms used in large-scale proteomics inverstigations. Nature Methods 2 667-675.
[7] Elias, J. and Gygi, S. (2007). Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature Methods 4 207-214.
[8] Eng, J., McCormack, A. and Yates, J. I. (1994). An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5 976-989.
[9] Feng, J., Naiman, Q. and Cooper, B. (2007). Probability model for assessing protein assembled from peptide sequences inferred from tandem mass spectrometry data. Anal. Chem. 79 3901-3911.
[10] Kall, L., Canterbury, J., Weston, J., Noble, W. S. and MacCoss, M. J. (2007). A semi-supervised machine learning technique for peptide identification from shotgun proteomics datasets. Nature Methods 4 923-925.
[11] Keller, A. Purvine, S., Nesvizhskii, A. I., Stolyar, S., Goodlett, D. R. and Kolker, E. (2002). Experimental protein mixture for validating tandem mass spectral analysis. Omics 6 207-212.
[12] Keller, A., Nesvizhskii, A., Kolker, E. and Aebersold, R. (2002). Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74 5383-5392.
[13] Kinter, M. and Sherman, N. E. (2003). Protein Sequencing and Identification Using Tandem Mass Spectrometry . Wiley, New York.
[14] Li, Q. (2008). Statistical methods for peptide and protein identification in mass spectrometry. Ph.D. thesis, Univ. Washington, Seattle, WA.
[15] Nesvizhskii, A. I. and Aebersold, R. (2004). Analysis, statistical validation and dissermination of large-scale proteomics datasets generated by tandem MS. Drug Discovery Todays 9 173-181.
[16] Nesvizhskii, A. I., Keller, A., Kolker, E. and Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75 4646-4653.
[17] Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155-176. · Zbl 1096.62124 · doi:10.1093/biostatistics/5.2.155
[18] Price, T. S., Lucitt, M. B., Wu, W., Austin, D. J., Pizarro, A., Yocum, A. K., Blair, I. A., FitzGerald, G. A. and Grosser, T. (2007). EBP, a program for protein identification using multiple tandem mass spectrometry data sets. Mol. Cell. Proteomics 6 527-536.
[19] Purvine, S., Picone, A. F. and Kolker, E. (2004). Standard mixtures for proteome studies. Omics 8 79-92.
[20] Sadygov, R., Cociorva, D. and Yates, J. (2004). Large-scale database searching using tandem mass spectra: Looking up the answer in the back of the book. Nature Methods 1 195-202.
[21] Sadygov, R., Liu, H. and Yates, J. (2004). Statistical models for protein validation using tandem mass spectral data and protein amino acid sequence databases. Anal. Chem. 76 1664-1671.
[22] Sadygov, R. and Yates, J. (2003). A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75 3792-3798.
[23] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[24] Shen, C., Wang, Z., Shankar, G., Zhang, X. and Li, L. (2008). A hierarchical statistical model to assess the confidence of peptides and proteins inferred from tandem mass spectrometry. Bioinformatics 24 202-208. · Zbl 1254.92006 · doi:10.1007/978-3-540-74891-5
[25] Steen, H. and Mann, M. (2004). The abc’s (and xyz’s) of peptide sequencing. Nature Reviews 5 699-712.
[26] Tabb, D., McDonald, H. and Yates, J. I. (2002). Dtaselect and contrast: Tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 1 21-36.
[27] Vermunt, J. K. (2003). Multilevel latent class models. Sociological Methodology 33 213-239.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.