×

On the entropy of protein families. (English) Zbl 1359.92032

Summary: Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, …The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including hidden Markov models used for protein databases and inferred statistical models reproducing the low-order (1- and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the mutation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.

MSC:

92C40 Biochemistry, molecular biology
62P10 Applications of statistics to biology and medical sciences; meta analysis
92D20 Protein sequences, DNA sequences
60J22 Computational methods in Markov chains

Software:

Pfam; UniProt
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Durbin, R., Sean Eddy, R., Krogh, A., Mitchison, G.: Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, London (1998) · Zbl 0929.92010 · doi:10.1017/CBO9780511790492
[2] Ashkenazy, H., Erez, E., Martz, E., Pupko, T., Ben-Tal, N.: ConSurf 2010: calculating evolutionary conservation in sequence and structure of proteins and nucleic acids. Nucl. Acids Res. 38, W529-W533 (2010) · doi:10.1093/nar/gkq399
[3] Lapedes, A.S., Giraud, B.G., Liu, L., Stormo, G.D.: Correlated mutations in models of protein sequences: phylogenetic and structural effects. Lect. Notes-Monogr. Ser. 33, 236-256 (1999) · doi:10.1214/lnms/1215455556
[4] Rausell, A., Juan, D., Pazos, F., Valencia, A.: Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl. Acad. Sci. 107(5), 1995-2000 (2010) · doi:10.1073/pnas.0908044107
[5] Pazos, F., Helmer-Citterich, E., Ausiello, G., Valencia, A.: Correlated mutations contain information about protein- protein interaction. J. Mol. Biol. 271, 511-523 (1997) · doi:10.1006/jmbi.1997.1198
[6] de Juan, D., Pazos, F., Valencia, A.: Emerging methods in protein co-evolution. Nat. Rev. Genet. 14, 249-261 (2013) · doi:10.1038/nrg3414
[7] Berman, H.M., Kleywegt, G.J., Nakamura, H., Markley, J.L.: The protein data bank at 40: reflecting on the past to prepare for the future. Structure 20(3), 391-396 (2012) · doi:10.1016/j.str.2012.01.010
[8] The Uniprot Consortium: Reorganizing the protein space at the Universal Protein Resource (UniProt). Nucl. Acids Res. 40, D71 (2012)
[9] Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J.G., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., Holm, L., Sonnhammer, E.L.L., Eddy, S.R., Bateman, Al, Finn, R.D.: The Pfam protein families database. Nucl. Acids Res. 40, D290 (2012) · doi:10.1093/nar/gkr1065
[10] Jaynes, E.T.: On the rationale of maximum-entropy methods. Proc. IEEE 70(9), 939-952 (1982) · doi:10.1109/PROC.1982.12425
[11] Bialek, William: Biophysics: Searching for Principles. Princeton University Press, Princeton (2012)
[12] Weigt, Martin, White, Robert A., Szurmant, Hendrik, Hoch, James A., Hwa, Terence: Identification of direct residue contacts in protein-protein interaction by message passing. Proc. Natl. Acad. Sci. USA 106(1), 67-72 (2009) · doi:10.1073/pnas.0805923106
[13] Burger, L., van Nimwegen, E.: Disentangling Direct from Indirect Co-Evolution of Residues in Protein Alignments. PLoS Comput. Biol. 6, E1000633 (2010) · doi:10.1371/journal.pcbi.1000633
[14] Balakrishnan, S., Kamisetty, H., Carbonell, J.G., Lee, S.I., Langmead, C.J.: Learning generative models for protein fold families. Proteins: Struct. Funct. Bioinf. 79, 1061 (2011) · doi:10.1002/prot.22934
[15] Cocco, Simona, Monasson, Rémi: Adaptive cluster expansion for inferring Boltzmann machines with noisy data. Phys. Rev. Lett. 106, 090601 (2011) · Zbl 1100.68576 · doi:10.1103/PhysRevLett.106.090601
[16] Cocco, Simona, Monasson, Rémi: Adaptive cluster expansion for the inverse ising problem: convergence, algorithm and tests. J. Stat. Phys. 147(2), 252-314 (2012) · Zbl 1243.82018 · doi:10.1007/s10955-012-0463-4
[17] Shakhnovich, E., Gutin, A.: Enumeration of all compact conformations of coplymers with random sequence of links. J. Chem. Phys. 93, 5967-5971 (1990) · doi:10.1063/1.459480
[18] Shakhnovich, E.: Protein design: a perspective from simple tractable models. Fold. Des. 3, R45-R58 (1998) · doi:10.1016/S1359-0278(98)00021-2
[19] Finn, Robert D., Mistry, Jaina, Tate, John, Coggill, Penny, Heger, Andreas, Pollington, Joanne E., Luke Gavin, O., Gunasekaran, Prasad, Ceric, Goran, Forslund, Kristoffer, Holm, Liisa, Sonnhammer, Erik L.L., Eddy, Sean R., Bateman, Alex: The pfam protein families database. Nucl. Acids Res. 38(suppl 1), D211-D222 (2010) · doi:10.1093/nar/gkp985
[20] Barton, J.P., Cocco, S., De Leonardis, E., Monasson, R.: Large pseudocounts and L2-norm penalties are necessary for the mean-field inference of Ising and Potts models. Phys. Rev. E 90(1), 012132 (2014) · doi:10.1103/PhysRevE.90.012132
[21] Morcos, F., Pagnani, A., Lunt, B., Bertolino, A., Marks, D.S., Sander, C., Zecchina, R., Onuchic, J.N., Hwa, Terence, Weigt, Martin: Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. USA 108(49), E1293-E1301 (2011) · doi:10.1073/pnas.1111471108
[22] Ekeberg, M., Lovkvist, C., Lan, Y., Weigt, M., Aurell, E.: Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Phys. Rev. E 87, 012707 (2013) · doi:10.1103/PhysRevE.87.012707
[23] Cocco, S., Monasson, R., Weigt, M.: From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction. PLoS Comput. Biol. 9, E1003176 (2013) · doi:10.1371/journal.pcbi.1003176
[24] Russ, W., Lowery, D.M., Mishra, P., Yaffe, M.B., Ranganathan, R.: Natural-like function in artificial WW domains. Nature 437, 579-583 (2005) · doi:10.1038/nature03990
[25] Socolich, Michael, Lockless, Steve W., Russ, William P., Lee, Heather, Gardner, Kevin H., Ranganathan, Rama: Evolutionary information for specifying a protein fold. Nature 437(7058), 512-518 (2005) · doi:10.1038/nature03991
[26] Korber, Bette, Gaschen, Brian, Yusim, Karina, Thakallapally, Rama, Keşmir, Can, Detours, Vincent: Evolutionary and immunological implications of contemporary HIV-1 variation. Br. Med. Bull. 58(1), 19-42 (2001) · doi:10.1093/bmb/58.1.19
[27] Ferguson, Andrew L., Mann, Jaclyn K., Omarjee, Saleha, Ndung’u, Thumbi, Walker, Bruce D., Chakraborty, Arup K.: Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38(3), 606-617 (2013) · doi:10.1016/j.immuni.2012.11.022
[28] Mann, Jaclyn K., Barton, John P., Ferguson, Andrew L., Omarjee, Saleha, Walker, Bruce D., Chakraborty, Arup K., Ndung’u, Thumbi: The fitness landscape of HIV-1 Gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10(8), e1003776 (2014) · doi:10.1371/journal.pcbi.1003776
[29] Haq, Omar, Andrec, Michael, Morozov, Alexandre V., Levy, Ronald M.: Correlated electrostatic mutations provide a reservoir of stability in HIV protease. PLoS Comput. Biol. 8(9), e1002675 (2012) · doi:10.1371/journal.pcbi.1002675
[30] Flynn, William F., Chang, Max W., Tan, Zhiqiang, Oliveira, Glenn, Yuan, Jinyun, Okulicz, Jason F., Torbett, Bruce E., Levy, Ronald M.: Deep sequencing of protease inhibitor resistant HIV patient isolates reveals patterns of correlated mutations in gag and protease. PLoS Comput. Biol. 11(4), e1004249 (2015) · doi:10.1371/journal.pcbi.1004249
[31] Shekhar, K., Ruberman, C.F., Ferguson, A.L., Barton, J.P., Kardar, M., Chakraborty, A.K.: Spin models inferred from patient-derived viral sequence data faithfully describe HIV fitness landscapes. Phys. Rev. E 88(6), 062705 (2013) · doi:10.1103/PhysRevE.88.062705
[32] Addo, M.M., Yu, X.G., Rathod, A., Eldridge, R.L., Strick, D., Johnston, M.N., Corcoran, C., Fitzpatrick, C.A., Feeney, M.E., Rodriguez, W.R., Basgoz, N., Draenert, R., Brander, C., Goulder, P.J.R., Rosenberg, E.S., Altfeld, Marcus, Walker, Bruce D.: Comprehensive epitope analysis of human immunodeficiency virus type 1 (HIV-1)-specific T-cell responses directed against the entire expressed HIV-1 genome demonstrate broadly directed responses, but no correlation to viral load. J. Virol. 77(3), 2081-2092 (2003) · doi:10.1128/JVI.77.3.2081-2092.2003
[33] Streeck, H., Jolin, J.S., Qi, Ying, Yassine-Diab, B., Johnson, R.C., Kwon, D.S., Addo, M.M., Brumme, C., Routy, J.P., Little, S., Jessen, H.K., Kelleher, A.D., Hecht, F.M., Sekaly, R.P., Rosenberg, E.S., Walker, Bruce D., Carrington, Mary, Altfeld, Marcus: Human immunodeficiency virus type 1-specific CD8+ T-cell responses during primary infection are major determinants of the viral set point and loss of CD4+ T cells. J. Virol. 83(15), 7641-7648 (2009) · doi:10.1128/JVI.00182-09
[34] Zhao, Gongpu, Perilla, Juan R., Yufenyuy, Ernest L., Meng, Xin, Chen, Bo, Ning, Jiying, Ahn, Jinwoo, Gronenborn, Angela M., Schulten, Klaus, Aiken, Christopher, et al.: Mature hiv-1 capsid structure by cryo-electron microscopy and all-atom molecular dynamics. Nature 497(7451), 643-646 (2013) · doi:10.1038/nature12162
[35] Dahirel, V., Shekhar, K., Florencia, P., Miura, T., Artyomov, M., Talsania, S., Allen, T.M., Altfeld, M., Carrington, M., Irvine, D.J., Walker, B.D., Chakraborty, A.K.: Coordinate linkage of HIV evolution reveals regions of immunological vulnerability. Proc. Natl. Acad. Sci. 108(28), 11530-11535 (2011) · doi:10.1073/pnas.1105315108
[36] Barton, John P., Kardar, Mehran, Chakraborty, Arup K.: Scaling laws describe memories of host pathogen riposte in the HIV population. Proc. Natl. Acad. Sci. 112(7), 1965-1970 (2015) · doi:10.1073/pnas.1415386112
[37] Beitzel, B.F., Bakken, R.R., Smith, J.M., Schmaljohn, C.S.: High-resolution functional mapping of the venezuelan equine encephalitis virus genome by insertional mutagenesis and massively parallel sequencing. PLoS Pathog. 6(10), e1001146 (2010) · doi:10.1371/journal.ppat.1001146
[38] Heaton, Nicholas S., Sachs, David, Chen, Chi-Jene, Hai, Rong, Palese, Peter: Genome-wide mutagenesis of influenza virus reveals unique plasticity of the hemagglutinin and ns1 proteins. Proc. Natl. Acad. Sci. 110(50), 20248-20253 (2013) · doi:10.1073/pnas.1320524110
[39] Remenyi, R., Qi, H., Su, S.Y., Chen, Z., Wu, N.C., Arumugaswami, V., Truong, S., Chu, V., Stokelman, T., Lo, H.H., Olson, A., Wu, T.T., Chen, S.H., Lin, C.Y., Sun, R.: A comprehensive functional map of the hepatitis c virus genome provides a resource for probing viral proteins. mBio 5, e01469-14 (2014) · doi:10.1128/mBio.01469-14
[40] Fulton, B.O., Sachs, D., Beaty, S.M., Won, S.T., Lee, B., Palese, P., Heaton, N.S.: Mutational analysis of measles virus suggests constraints on antigenic variation of the glycoproteins. Cell Rep. 11(9), 1331-1338 (2015) · doi:10.1016/j.celrep.2015.04.054
[41] Ferrari, Guido, Korber, Bette, Goonetilleke, Nilu, Liu, Michael K.P., Turnbull, Emma L., Salazar-Gonzalez, Jesus F., Hawkins, Natalie, Self, Steve, Watson, Sydeaka, Betts, Michael R., Gay, Cynthia, McGhee, Cynthia, Pellegrino, Pierre, Williams, Ian, Tomaras, Georgia D., Haynes, Barton F., Gray, Clive M., Borrow, Persephone, Roederer, Mario, McMichael, Andrew J., Weinhold, Kent J.: Relationship between functional profile of HIV-1 specific CD8 T cells and epitope variability with the selection of escape mutants in acute HIV-1 infection. PLoS Pathog. 7(2), e1001273 (2011) · doi:10.1371/journal.ppat.1001273
[42] Liu, M.K.P., Hawkins, N., Ritchie, A.J., Ganusov, V.V., Whale, V., Brackenridge, S., Li, H., Pavlicek, J.W., Cai, F., Rose-Abrahams, M., Treurnicht, F., Hraber, P., Riou, C., Gray, C., Ferrari, G., Tanner, R., Ping, L.H., Anderson, J.A., Swanstrom, R., Cohen, M., Abdool Karim, S.S., Haynes, B., Borrow, P., Perelson, A.S., Shaw, G.M., Hahn, B.H., Williamson, C., Korber, B.T., Gao, F., Self, S., McMichael, A., Goonetilleke, N.: Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. J. Clin. Investig. 123(1), 380-393 (2013)
[43] Li, H., Helling, R., Tang, C., Wingreen, N.: Emergence of preferred structures in a simple model of protein folding. Science 273, 666-669 (1996) · doi:10.1126/science.273.5275.666
[44] Li, H., Tang, C., Wingreen, N.: Designability of protein structures: a lattice-model study using the miyazawa-jernigan matrix. Proteins 49, 403-412 (2002) · doi:10.1002/prot.10239
[45] England, Jeremy L., Shakhnovich, Eugene I.: Structural determinant of protein designability. Phys. Rev. Lett. 90, 218101 (2003) · doi:10.1103/PhysRevLett.90.218101
[46] Miyazawa, A., Jernigan, R.: Estimation of effective interresidue contact energies from protein crystal structures: quasi-chemical approximation. Macromolecules 18, 534 (1985) · doi:10.1021/ma00145a039
[47] Jacquin, H., Gilson, A., Shakhnovich, E., Cocco, S., Monasson, R.: Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models. available on Biorxiv, 2015. doi: 10.1101/028936
[48] Berezovsky, I.N., Zeldovich, K.B., Shakhnovich, E.: Positive and negative design in stability and thermal adaptation of natural proteins. PLoS Comput. Biol. 3(32), e52 (2007) · doi:10.1371/journal.pcbi.0030052
[49] Keefe, Anthony, Szostak, W.Jack: Functional proteins from a random-sequence library. Nature 410(6829), 715-718 (2001) · doi:10.1038/35070613
[50] Greenbaum, B., Cocco, S., Levine, A., Monasson, R.: A quantitative theory of entropic forces acting on constrained nucleotide sequences applied to viruses. Proc. Natl. Acad. Sci. USA 111, 5054-5059 (2014) · doi:10.1073/pnas.1402285111
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.