Bayesian protein structure alignment. (English) Zbl 1454.62387

Summary: The analysis of the three-dimensional structure of proteins is an important topic in molecular biochemistry. Structure plays a critical role in defining the function of proteins and is more strongly conserved than amino acid sequence over evolutionary timescales. A key challenge is the identification and evaluation of structural similarity between proteins; such analysis can aid in understanding the role of newly discovered proteins and help elucidate evolutionary relationships between organisms. Computational biologists have developed many clever algorithmic techniques for comparing protein structures, however, all are based on heuristic optimization criteria, making statistical interpretation somewhat difficult. Here we present a fully probabilistic framework for pairwise structural alignment of proteins. Our approach has several advantages, including the ability to capture alignment uncertainty and to estimate key “gap” parameters which critically affect the quality of the alignment. We show that several existing alignment methods arise as maximum a posteriori estimates under specific choices of prior distributions and error models. Our probabilistic framework is also easily extended to incorporate additional information, which we demonstrate by including primary sequence information to generate simultaneous sequence-structure alignments that can resolve ambiguities obtained using structure alone. This combined model also provides a natural approach for the difficult task of estimating evolutionary distance based on structural alignments. The model is illustrated by comparison with well-established methods on several challenging protein alignment examples.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference


MAMMOTH; Balibase
Full Text: DOI arXiv Euclid


[1] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215 403-410.
[2] Bishop, M. J. and Thompson, E. A. (1986). Maximum likelihood alignment of DNA sequences. J. Mol. Biol. 190 159-165.
[3] Bronner, C. E. et al. (1994). Mutation in the DNA mismatch repair gene homologue hMLH1 is associated with hereditary non-polyposis colon cancer. Nature 368 258-261.
[4] Brown, N., Orengo, C. and Taylor, W. (1996). A protein structure comparison methodology. Compational Chemistry 20 359-380.
[5] Chothia, C. and Lesk, A. M. (1986). The relation between the divergence of sequence and structure in proteins. EMBO J. 5 823-826.
[6] Cohen, G. (1997). ALIGN: A program to superimpose protein coordinates, accounting for insertions and deletions. Acta Crystallographica 30 1160-1161.
[7] Dayhoff, M. and Eck, R., eds. (1968). Atlas of Protein Sequence and Structure 3 . National Biomedical Research Fundation, Silver Spring, MD.
[8] Dryden, I. L., Hirst, J. D. and Melville, J. L. (2007). Statistical analysis of unlabeled point sets: Comparing molecules in chemoinformatics. Biometrics 63 237-251, 315. · Zbl 1122.62090
[9] Dryden, I. L. and Mardia, K. V. (1998). Statistical Shape Analysis . Wiley, Chichester. · Zbl 0901.62072
[10] Durbin, R., Eddy, S., Krogh, A. and Mitchison, G. (1998). Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids . Cambridge Univ. Press, Cambridge. · Zbl 0929.92010
[11] Eidhammer, I., Jonassen, I. and Taylor, W. R. (2000). Structure comparison and structure patterns. J. Comput. Biol. 7 685-716.
[12] Falicov, A. and Cohen, F. E. (1996). A surface of minimum area metric for the structural comparison of proteins. J. Mol. Biol. 258 871-892.
[13] Fischer, D., Wolfson, H., Lin, S. and Nussinov, R. (1994). Three-dimensional, sequence order-independent structural comparison of a serine protease against the crystallographic database reveals active site similaties: Potential implications to evolution and to protein folding. Protein Sci. 3 768-778.
[14] Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457-472. · Zbl 1386.65060
[15] Gerstein, M. and Levitt, M. (1998). Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci. 7 445-456.
[16] Gibrat, J.-F., Madej, T. and Bryant, S. (1996). Surprising similarities in structure comparison. Current Opinion in Structual Biology 6 377-385.
[17] Godzik, A. (1996). The structural alignment between two proteins: Is there a unique answer? Protein Eng. 5 1325-1338.
[18] Green, P. J. and Mardia, K. V. (2006). Bayesian alignment using hierarchical models, with applications in protein bioinformatics. Biometrika 93 235-254. · Zbl 1153.62020
[19] Grishin, N. V. (1997). Estimation of evolutionary distances from protein spatial structures. J. Mol. Evol. 45 359-369.
[20] Henikoff, S. and Henikoff, J. (1992). Amino acid substituion matrices from protein blocks. Procedings of National Academy of Science 89 10915-10919.
[21] Holm, L. and Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233 123-138.
[22] Johnson, M. S., Sutcliffe, M. J. and Blundell, T. L. (1990). Molecular anatomy: Phylectic relationships derived from three-dimensional structures of proteins. J. Mol. Evol. 30 43-59.
[23] Kendall, D. G., Barden, D., Carne, T. K. and Le, H. (1999). Shape and Shape Theory . Wiley, Chichester. · Zbl 0940.60006
[24] Kenobi, K. and Dryden, I. L. (2012). Bayesian matching of unlabeled point sets using Procrustes and configuration models. Bayesian Anal. 7 547-565. · Zbl 1330.62138
[25] Koehl, P. and Levitt, M. (2002). Sequence variations within protein families are linearly related to structural variations. J. Mol. Biol. 323 551-562.
[26] Kotlovyi, V., Nichols, W. L. and Eyck, L. F. T. (2003). Protein structural alignment for detection of maximally conserved regions. Biophys. Chem. 105 595-608.
[27] Lackner, P., Koppensteiner, W. A., Sippl, M. J. and Domingues, F. S. (2000). ProSup: A refined tool for protein structure alignment. Protein Eng. 13 745-752.
[28] Lemmen, C. and Lengauer, T. (2000). Computational methods for the structural alignment of molecules. J. Comput.-Aided Mol. Des. 14 215-232.
[29] Levitt, M. and Gerstein, M. (1998). A unified statistical framework for sequence comparison and structure comparison. Proceedings of the National Academy of Sciencies USA 95 5913-5920.
[30] Lipman, D. J. and Pearson, W. R. (1985). Rapid and sensitive protein similarity searches. Science 227 1435-1441.
[31] Liu, J. S. and Lawrence, C. E. (1999). Bayesian inference on biopolymer models. Bioinformatics 15 38-52.
[32] Mizuguchi, K. and Go, N. (1995). Seeking significance in three-dimensional protein structure comparisons. Curr. Opin. Struck. Biol. 5 377-382.
[33] Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48 443-453.
[34] Ortiz, A. R., Strauss, C. E. M. and Olmea, O. (2002). Mammoth (Matching molecular models obtained from theory): An automated method for model comparison. Protein Sci. 11 2606-2621.
[35] Papadopoulos, N., Nicolaides, N. C., Wei, Y. F., Ruben, S. M., Carter, K. C., Rosen, C. A., Haseltine, W. A., Fleischmann, R. D., Fraser, C. M., Adams, M. D. et al. (1994). Mutation of a mutL homolog in hereditary colon cancer. Science 263 1625-1629.
[36] Rao, S. and Rossmann, M. (1973). Comparison of super-secondary structures in proteins. J. Mol. Biol. 105 241-256.
[37] Rossmann, M. and Argos, P. (1975). A comparison of heme binding pocket in globins and cytochrombe b5\ast . J. Biol. Chem. 250 7523-7532.
[38] Rossmann, M. and Argos, P. (1976). Exploring structural homology in proteins. J. Mol. Biol. 105 75-96.
[39] Satow, Y., Cohen, G., Padlan, E. and Avies, D. (1986). Phosphocholine binding immunoglobulin fab mcpc603: An x-ray diffraction study at 2.7 a. J. Mol. Biol. 190 593-604.
[40] Saul, L. K. and Jordan, M. I. (1995). Boltzmann chains and hidden Markov models. In Advances in Neural Information Processing Systems ( NIPS ) 7 (G. Tesauro, D. S. Touretzky and T. Leen, eds.). MIT Press, Cambridge, MA.
[41] Schmidler, S. C. (2003) Statistical shape analysis of protein structure families. In Presentation at the LASR workshop on Stochastic Geometry , Biological Structure and Images , Leeds, UK.
[42] Schmidler, S. C. (2004). Bayesian shape matching and protein structure alignment. In Presentation at the 6 th World Congress of the Bernoulli Society and the 67 th Annual Meeting of the Institute of Mathematical Statistics , Barcelona, Spain.
[43] Schmidler, S. C. (2007a). Fast Bayesian shape matching using geometric algorithms. In Bayesian Statistics 8 (M. Bernardo, J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. Smith and M. West, eds.) 471-490. Oxford Univ. Press, Oxford. · Zbl 1252.62005
[44] Schmidler, S. C. (2007b). Bayesian flexible shape matching with applications to structural bioinformatics. Technical report, Dept. Statistical Sciences, Duke Univ., Durham, NC.
[45] Schmidler, S. C., Lucas, J. E. and Oas, T. G. (2007). Statistical estimation of statistical mechanical models: Helix-coil theory and peptide helicity prediction. J. Comput. Biol. 14 1287-1310.
[46] Shindyalov, I. N. and Bourne, P. E. (1998). Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 11 739-747.
[47] Small, C. G. (1996). The Statistical Theory of Shape . Springer, New York. · Zbl 0859.62087
[48] Taylor, W. R. (2002). Protein structure comparison using bipartite graph matching and its application to protein structure classification. Mol. Cell. Proteomics 1 334-339.
[49] Taylor, W. R. and Orengo, C. A. (1989). Protein structure alignment. J. Mol. Biol. 208 1-22.
[50] Thompson, J. D., Plewniak, F. and Poch, O. (1999). BAliBASE: A benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15 87-88.
[51] Thompson, J. D., Koehl, P., Ripp, R. and Poch, O. (2005). BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins 61 127-136.
[52] Wallace, A., Laskowsi, R. and Thornton, J. (1996). Tess: A geometric hashing algorithm for deriving 3d coordinate templates for searching structural databases: Applications to enzyme active sites. Protein Sci. 6 2308-2323.
[53] Wang, R. and Schmidler, S. C. (2008). Bayesian multiple protein structure alignment and analysis of protein families. In preparation.
[54] Webb, B.-J. M., Liu, J. S. and Lawrence, C. E. (2002). BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res. 30 1268-1277.
[55] Wood, T. C. and Pearson, W. R. (1999). Evolution of pretein sequences and structures. J. Mol. Biol. 291 977-995.
[56] Wu, T. D., Schmidler, S. C., Hastie, T. and Brutlag, D. L. (1998). Regression analysis of multiple protein structures. J. Comput. Biol. 5 597-607.
[57] Zhu, J., Liu, J. S. and Lawrence, C. E. (1998). Bayesian adaptive sequence alignment algorithms. Bioinformatics 14 25-39.
[58] Zu-Kang, F. and Sippl, M. (1996). Optimum superimposition of protein structures: Ambiguities and implications. Fold. Des. 1 123-132.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.