Adaptive compressive learning for prediction of protein-protein interactions from primary sequence. (English) Zbl 1397.92243

Summary: Protein-protein interactions (PPIs) play an important role in biological processes. Although much effort has been devoted to the identification of novel PPIs by integrating experimental biological knowledge, there are still many difficulties because of lacking enough protein structural and functional information. It is highly desired to develop methods based only on amino acid sequences for predicting PPIs. However, sequence-based predictors are often struggling with the high-dimensionality causing over-fitting and high computational complexity problems, as well as the redundancy of sequential feature vectors. In this paper, a novel computational approach based on compressed sensing theory is proposed to predict yeast Saccharomyces cerevisiae PPIs from primary sequence and has achieved promising results. The key advantage of the proposed compressed sensing algorithm is that it can compress the original high-dimensional protein sequential feature vector into a much lower but more condensed space taking the sparsity property of the original signal into account. What makes compressed sensing much more attractive in protein sequence analysis is its compressed signal can be reconstructed from far fewer measurements than what is usually considered necessary in traditional Nyquist sampling theory. Experimental results demonstrate that proposed compressed sensing method is powerful for analyzing noisy biological data and reducing redundancy in feature vectors. The proposed method represents a new strategy of dealing with high-dimensional protein discrete model and has great potentiality to be extended to deal with many other complicated biological systems.


92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
Full Text: DOI


[1] Agrafiotis, D.K., Stochastic proximity embedding, J. comput. chem., 24, 1215-1221, (2003)
[2] Alonso, C., Rotation forest: a new classifier ensemble method, IEEE trans. pattern anal., 28, 1619-1630, (2006)
[3] Aloy, P.; Russell, R.B., Interrogating protein interaction networks through structural biology, Proc. natl. acad. sci. USA, 99, 5896-5901, (2002)
[4] Aloy, P.; Russell, R.B., Interprets: protein interaction prediction through tertiary structure, Bioinformatics, 19, 161-162, (2003)
[5] Baraniuk, R.; Davenport, M.; DeVore, R.; Wakin, M., A simple proof of the restricted isometry property for random matrices, Constr. approx., 28, 253-263, (2008) · Zbl 1177.15015
[6] Ben-Hur, A.; Noble, W.S., Kernel methods for predicting protein – protein interactions, Bioinformatics, 21, Suppl. 1, i38-i46, (2005)
[7] Bock, J.R.; Gough, D.A., Predicting protein – protein interactions from primary structure, Bioinformatics, 17, 455-460, (2001)
[8] Brand, M., Charting a manifold, Adv. neural inf. process. syst., 985-992, (2003)
[9] Calderbank, R., Jafarpour, S., Schapire, R., 2009. Compressed learning: universal sparse dimensionality reduction and learning in the measurement domain, 〈http://dsp.rice.edu/files/cs/cl.pdf〉.
[10] Candes, E.J., The restricted isometry property and its implications for compressed sensing, C.R. math., 346, 589-592, (2008) · Zbl 1153.94002
[11] Candes, E.J.; Tao, T., Near-optimal signal recovery from random projections: universal encoding strategies?, IEEE trans. inform. theory, 52, 5406-5425, (2006) · Zbl 1309.94033
[12] Candes, E.J.; Romberg, J.K.; Tao, T., Stable signal recovery from incomplete and inaccurate measurements, Commun. pur. appl. math, 59, 1207-1223, (2006) · Zbl 1098.94009
[13] Chang, C.C., Lin, C.J., 2001. LIBSVM: a library for support vector machines. Software available at: 〈http://www.csie.ntu.edu.tw/cjlin/libsvm〉.
[14] Chartrand, R.; Baraniuk, R.G.; Eldar, Y.C.; Figueiredo, M.A.T.; Tanner, J., Introduction to the issue on compressive sensing, Ieee j-stsp, 4, 241-243, (2010)
[15] Chou, K.C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins, 43, 246-255, (2001)
[16] Chou, K.C.; Shen, H.B., Predicting protein subcellular location by fusing multiple classifiers, J. cell. biochem., 99, 517-527, (2006)
[17] Chou, K.C.; Shen, H.B., Hum-ploc: a novel ensemble classifier for predicting human protein subcellular localization, Biochem. biophys. res. commun., 347, 150-157, (2006)
[18] Chou, K.C.; Shen, H.B., Recent progress in protein subcellular location prediction, Anal. biochem., 370, 1-16, (2007)
[19] Chou, K.C.; Shen, H.B., Euk-mploc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites, J. proteome res., 6, 1728-1734, (2007)
[20] Deng, L.; Guan, J.; Dong, Q.; Zhou, S., Prediction of protein – protein interaction sites using an ensemble method, BMC bioinf., 10, 426, (2009)
[21] Dong, Q.W.; Zhou, S.G.; Liu, X., Prediction of protein – protein interactions from primary sequences, Int. J. data MIN. bioinf., 4, 211-227, (2010)
[22] Donoho, D.L., Compressed sensing, IEEE trans. inform. theory, 52, 1289-1306, (2006) · Zbl 1288.94016
[23] Fields, S.; Song, O., A novel genetic system to detect protein – protein interactions, Nature, 340, 245-246, (1989)
[24] Goodman, S.N., Toward evidence-based medical statistics. 1: the P value fallacy, Ann. intern. med., 130, 995-1004, (1999)
[25] Gorsuch, R.L., Factor analysis, (1983), L. Erlbaum Associates Hillsdale, N.J.
[26] Guo, Y.; Yu, L.; Wen, Z.; Li, M., Using support vector machine combined with auto covariance to predict protein – protein interactions from protein sequences, Nucl. acids res., 36, 3025-3030, (2008)
[27] Han, J.-D.J.; Dupuy, D.; Bertin, N.; Cusick, M.E.; Vidal, M., Effect of sampling on topology predictions of protein – protein interaction networks, Nat. biotechnol., 23, 839-844, (2005)
[28] He, X.F.; Niyogi, P., Locality preserving projections, Adv. neural info. processing syst., 16, 153-160, (2004)
[29] Ho, Y.; Gruhler, A.; Heilbut, A.; Bader, G.D.; Moore, L.; Adams, S.L.; Millar, A.; Taylor, P.; Bennett, K.; Boutilier, K.; Yang, L.; Wolting, C.; Donaldson, I.; Schandorff, S.; Shewnarane, J.; Vo, M.; Taggart, J.; Goudreault, M.; Muskat, B.; Alfarano, C.; Dewar, D.; Lin, Z.; Michalickova, K.; Willems, A.R.; Sassi, H.; Nielsen, P.A.; Rasmussen, K.J.; Andersen, J.R.; Johansen, L.E.; Hansen, L.H.; Jespersen, H.; Podtelejnikov, A.; Nielsen, E.; Crawford, J.; Poulsen, V.; Sorensen, B.D.; Matthiesen, J.; Hendrickson, R.C.; Gleeson, F.; Pawson, T.; Moran, M.F.; Durocher, D.; Mann, M.; Hogue, C.W.; Figeys, D.; Tyers, M., Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry, Nature, 415, 180-183, (2002)
[30] Hu, J.J.; Zhang, F., Improving protein localization prediction using amino acid group based physichemical encoding, Bioinf. comput. biol. proc., 5462, 248-258, (2009)
[31] Huang, C.B.; Morcos, F.; Kanaan, S.P.; Wuchty, S.; Chen, D.Z.; Izaguirre, J.A., Predicting protein – protein interactions from protein domains using a set cover approach, IEEE ACM trans. comput. biol., 4, 78-87, (2007)
[32] Hwang, S.; Son, S.W.; Kim, S.C.; Kim, Y.J.; Jeong, H.; Lee, D., A protein interaction network associated with asthma, J. theor. biol., 252, 722-731, (2008)
[33] Jansen, R.; Yu, H.; Greenbaum, D.; Kluger, Y.; Krogan, N.J.; Chung, S.; Emili, A.; Snyder, M.; Greenblatt, J.F.; Gerstein, M., A Bayesian networks approach for predicting protein – protein interactions from genomic data, Science, 302, 449-453, (2003)
[34] Kohavi, R.; John, G.H., Wrappers for feature subset selection, Artif. intell., 97, 273-324, (1997) · Zbl 0904.68143
[35] Kumar, M.; Verma, R.; Raghava, G.P., Prediction of mitochondrial proteins using support vector machine and hidden Markov model, J. biol. chem., 281, 5357-5363, (2006)
[36] Kumar, M.; Gromiha, M.M.; Raghava, G.P., Identification of DNA-binding proteins using support vector machines and evolutionary profiles, BMC bioinf., 8, 463, (2007)
[37] Kurgan, L.; Razib, A.A.; Aghakhani, S.; Dick, S.; Mizianty, M.; Jahandideh, S., CRYSTALP2: sequence-based protein crystallization propensity prediction, BMC struct. biol., 9, 50, (2009)
[38] Li, W.; Godzik, A., Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22, 1658-1659, (2006)
[39] Lin, N.; Wu, B.; Jansen, R.; Gerstein, M.; Zhao, H., Information assessment on predicting protein – protein interactions, BMC bioinf., 5, 154, (2004)
[40] Liu, L.; Cai, Y.; Lu, W.; Feng, K.; Peng, C.; Niu, B., Prediction of protein – protein interactions based on pseaa composition and hybrid feature selection, Biochem. biophys. res. commun., 380, 318-322, (2009)
[41] Martin, S.; Roe, D.; Faulon, J.L., Predicting protein – protein interactions using signature products, Bioinformatics, 21, 218-226, (2005)
[42] Nanni, L., Hyperplanes for predicting protein – protein interactions, Neurocomputing, 69, 257-263, (2005)
[43] Nanni, L., Experimental comparison of one-class classifiers for online signature verification, Neurocomputing, 69, 869-873, (2006)
[44] Nanni, L.; Lumini, A., Mpps: an ensemble of support vector machine based on multiple physicochemical properties of amino acids, Neurocomputing, 69, 1688-1690, (2006)
[45] Nanni, L.; Lumini, A., An ensemble of K-local hyperplanes for predicting protein – protein interactions, Bioinformatics, 22, 1207-1210, (2006)
[46] Nanni, L.; Brahnam, S.; Lumini, A., High performance set of pseaac and sequence based descriptors for protein classification, J. theor. biol., 266, 1-10, (2010) · Zbl 1407.92103
[47] Ogmen, U.; Keskin, O.; Aytuna, A.S.; Nussinov, R.; Gursoy, A., PRISM: protein interactions by structural matching, Nucl. acids res., 33, W331-W336, (2005)
[48] Ou, Y.Y.; Chen, S.A.; Gromiha, M.M., Classification of transporters using efficient radial basis function networks with position-specific scoring matrices and biochemical properties, Proteins, 78, 1789-1797, (2010)
[49] Overbeek, R.; Fonstein, M.; D’Souza, M.; Pusch, G.D.; Maltsev, N., Use of contiguity on the chromosome to predict functional coupling, In silico biol., 1, 93-108, (1999)
[50] Pagel, P.; Mewes, H.W.; Frishman, D., Conservation of protein – protein interactions—lessons from ascomycota, Trends genet., 20, 72-76, (2004)
[51] Pagel, P.; Wong, P.; Frishman, D., A domain interaction map based on phylogenetic profiling, J. mol. biol., 344, 1331-1346, (2004)
[52] Pagel, P.; Oesterheld, M.; Stumpflen, V.; Frishman, D., The DIMA web resource—exploring the protein domain network, Bioinformatics, 22, 997-998, (2006)
[53] Pagel, P.; Oesterheld, M.; Tovstukhina, O.; Strack, N.; Stumpflen, V.; Frishman, D., DIMA 2.0—predicted and known domain interactions, Nucl. acids res., 36, D651-D655, (2008)
[54] Pan, X.Y.; Zhang, Y.N.; Shen, H.B., Large-scale prediction of human protein – protein interactions from amino acid sequence based on latent topic features, J. proteome res., 9, 4992-5001, (2010)
[55] Park, Y., Critical assessment of sequence-based protein – protein interaction prediction methods that do not require homologous protein sequences, BMC bioinf., 10, 419, (2009)
[56] Pitre, S.; North, C.; Alamgir, M.; Jessulat, M.; Chan, A.; Luo, X.; Green, J.R.; Dumontier, M.; Dehne, F.; Golshani, A., Global investigation of protein – protein interactions in yeast saccharomyces cerevisiae using re-occurring short polypeptide sequences, Nucl. acids res., 36, 4286-4294, (2008)
[57] Pitre, S.; Dehne, F.; Chan, A.; Cheetham, J.; Duong, A.; Emili, A.; Gebbia, M.; Greenblatt, J.; Jessulat, M.; Krogan, N.; Luo, X.; Golshani, A., PIPE: a protein – protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs, BMC bioinf., 7, 365, (2006)
[58] Rao, R.; Tun, K.; Lakshminarayanan, S.; Dhar, P.K., Amino-acid residue association models for large scale protein – protein interaction prediction, In silico biol., 9, 179-194, (2009)
[59] Schwikowski, B.; Uetz, P.; Fields, S., A network of protein – protein interactions in yeast, Nat. biotechnol., 18, 1257-1261, (2000)
[60] Shen, H.B.; Chou, K.C., Hum-mploc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites, Biochem. biophys. res. commun., 355, 1006-1011, (2007)
[61] Shen, H.B.; Chou, K.C., Pseaac: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. biochem., 373, 386-388, (2008)
[62] Shen, J.; Zhang, J.; Luo, X.; Zhu, W.; Yu, K.; Chen, K.; Li, Y.; Jiang, H., Predicting protein – protein interactions based only on sequences information, Proc. natl. acad. sci. USA, 104, 4337-4341, (2007)
[63] Smialowski, P.; Frishman, D.; Kramer, S., Pitfalls of supervised feature selection, Bioinformatics, 26, 440-443, (2010)
[64] Smialowski, P.; Schmidt, T.; Cox, J.; Kirschner, A.; Frishman, D., Will my protein crystallize? A sequence-based predictor, Proteins, 62, 343-355, (2006)
[65] Smialowski, P.; Martin-Galiano, A.J.; Mikolajka, A.; Girschick, T.; Holak, T.A.; Frishman, D., Protein solubility: sequence based prediction and experimental verification, Bioinformatics, 23, 2536-2542, (2007)
[66] Smialowski, P.; Pagel, P.; Wong, P.; Brauner, B.; Dunger, I.; Fobo, G.; Frishman, G.; Montrone, C.; Rattei, T.; Frishman, D.; Ruepp, A., The negatome database: a reference set of non-interacting protein pairs, Nucl. acids res., 38, D540-D544, (2010)
[67] Song, J.; Tan, H.; Takemoto, K.; Akutsu, T., Hsepred: predict half-sphere exposure from protein sequences, Bioinformatics, 24, 1489-1497, (2008)
[68] Ta, H.X.; Holm, L., Evaluation of different domain-based methods in protein interaction prediction, Biochem. biophys. res. commun., 390, 357-362, (2009)
[69] Tropp, J.A.; Gilbert, A.C., Signal recovery from random measurements via orthogonal matching pursuit, IEEE trans. inform. theory, 53, 4655-4666, (2007) · Zbl 1288.94022
[70] Xenarios, I.; Salwinski, L.; Duan, X.J.; Higney, P.; Kim, S.M.; Eisenberg, D., DIP, the database of interacting proteins: a research tool for studying cellular networks of protein interactions, Nucl. acids res., 30, 303-305, (2002)
[71] Xia, J.F.; Wang, S.L.; Lei, Y.K., Computational methods for the prediction of protein – protein interactions, Protein pept. lett., 17, 1069-1078, (2010)
[72] Zhang, H.; Zhang, T.; Chen, K.; Shen, S.; Ruan, J.; Kurgan, L., Sequence based residue depth prediction using evolutionary information and predicted secondary structure, BMC bioinf., 9, 388, (2008)
[73] Zhu, H.; Bilgin, M.; Bangham, R.; Hall, D.; Casamayor, A.; Bertone, P.; Lan, N.; Jansen, R.; Bidlingmaier, S.; Houfek, T.; Mitchell, T.; Miller, P.; Dean, R.A.; Gerstein, M.; Snyder, M., Global analysis of protein activities using proteome chips, Science, 293, 2101-2105, (2001)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.