Elman RNN based classification of proteins sequences on account of their mutual information. (English) Zbl 1337.92161

Summary: In the present work we have employed the method of estimating residue correlation within the protein sequences, by using the mutual information (MI) of adjacent residues, based on structural and solvent accessibility properties of amino acids. The long range correlation between nonadjacent residues is improved by constructing a mutual information vector (MIV) for a single protein sequence, like this each protein sequence is associated with its corresponding MIVs. These MIVs are given to Elman RNN to obtain the classification of protein sequences. The modeling power of MIV was shown to be significantly better, giving a new approach towards alignment free classification of protein sequences. We also conclude that sequence structural and solvent accessible property based MIVs are better predictor.


92D20 Protein sequences, DNA sequences
94A17 Measures of information, entropy
Full Text: DOI


[1] Altschul, S.F.; Gish, W.; Miller, W.; Myers, E.W.; Lipman, D.J., Basic local alignment search tool, J. mol. biol., 215, 403-410, (1990)
[2] Atchley, W.R.; Terhalle, W.; Dress, A., Positional dependence, cliques, and predictive motifs in the bhlh protein domain, J. mol. evol., 48, 5, 501-516, (1999)
[3] Bateman, A.; Coin, L.; Durbin, R.; Finn, R.D.; Hollich1, V.; Griffiths-Jones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E.L.L.; Studholme, D.J.; Yeats, C.; Eddy, S.R., The pfam protein families database, Nucleic acids res., 32, D138-D141, (2004)
[4] Bishop, C., Pattern recognition and machine learning, (2006), Springer New York, USA, 225-284
[5] Chou, K.C., A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins, J. biol. chem., 268, 16938-16948, (1993)
[6] Chou, K.C., Review: prediction of HIV protease cleavage sites in proteins, Anal. biochem., 233, 1-14, (1996)
[7] Chou, K.C., Review: prediction of tight turns and their types in proteins, Anal. biochem., 286, 1-16, (2000)
[8] Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. PROTEINS: Structure, Function, and Genetics (Erratum: ibid., 2001, vol.44, 60) 43, 246-255.
[9] Chou, K.C., Review: structural bioinformatics and its impact to biomedical science, Curr. med. chem., 11, 2105-2134, (2004)
[10] Chou, K.C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. theor. biol., 273, 236-247, (2011) · Zbl 1405.92212
[11] Chou, K.C.; Shen, H.B., Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides, Biochem. biophys. res. commun., 357, 633-640, (2007)
[12] Chou, K.C.; Shen, H.B., Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms (updated version: cell-ploc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms natural science 2010, 2, 1090-1103)., Nat. protoc., 3, 153-162, (2008), Nat. Protoc
[13] Chou, K.C.; Shen, H.B., Protident: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information, Biochem. biophys. res. commun., 376, 321-325, (2008)
[14] Chou, K.C., and Shen, H.B., 2009. Review: recent advances in developing web-servers for predicting protein attributes. Natural Science 2, 63-92 (openly accessible at http://www.scirp.org/journal/NS/).
[15] Chou, K.C.; Shen, H.B., A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites euk-mploc 2.0, Plos one, 5, e9931, (2010)
[16] Chou, K.C.; Shen, H.B., Plant-mploc: a top-down strategy to augment the power for predicting plant protein subcellular localization, Plos one, 5, e11335, (2010)
[17] Chou, K.C.; Zhang, C.T., Review: prediction of protein structural classes, Crit rev. biochem. mol. biol., 30, 275-349, (1995)
[18] Chou, K.C.; Wei, D.Q.; Zhong, W.Z., Binding mechanism of coronavirus main proteinase with ligands and its implication to drug design against SARS. (erratum: ibid., 2003, vol.310, 675), Biochem. biophys. res. commun., 308, 148-151, (2003)
[19] Chou, K.C.; Wu, Z.C.; Xiao, X., Iloc-euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, Plos one, 6, e18258, (2011)
[20] Chou, K.C.; Wu, Z.C.; Xiao, X., Iloc-hum: using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. biosyst., 8, 629-641, (2012)
[21] Cline, M.S.; arplus, K.K.; Lathrop, R.H.; Smith, T.F.; Rogers Jr, R.G.; Haussler, D., Information-theoretic dissection of pairwise contact potentials., Proteins: struc. func. genet., 49, 1, 7-14, (2002)
[22] Cover, T.M.; Thomas, J.A., Elements of information theory, (1991), Wiley-Interscience New York, NY, USA · Zbl 0762.94001
[23] Elman, J., Finding structure in time, Cog. sci., 14, 179-211, (1990)
[24] Esmaeili, M.; Mohabatkar, H.; Mohsenzadeh, S., Using the concept of chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses, J. theor. biol., 263, 203-209, (2010)
[25] Georgiou, D.N.; Karakasidis, T.E.; Nieto, J.J.; Torres, A., Use of fuzzy clustering technique and matrices to classify amino acids and its impact to chou’s pseudo amino acid composition, J. theor. biol., 257, 17-26, (2009)
[26] Grosse, I.; Herzel, H.; Buldyrev, S.V.; Stanley, H.E., Species independence of mutual information in coding and noncoding DNA, Phys. rev. E: stat. nonlinear soft matter phys., 61, 5, 5624-5629, (2000)
[27] He, Z.; Zhang, J.; Shi, X.H.; Hu, L.L.; Kong, X.; Cai, Y.D.; Chou, K.C., Predicting drug-target interaction networks based on functional groups and biological features, Plos one, 5, e9603, (2010)
[28] Hemmerich, C.; Kim, S., A study of residue correlation within protein sequences and its application to sequence classification, EURASIP, J.bioinf. syst. biol., (2007)
[29] Holm, I.; Sander, C., Protein folds and families: sequence and structure alignments, Nucleic acids res., 27, 244-247, (1999)
[30] Huang, T.; Chen, L.; Cai, Y.D.; Chou, K.C., Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property, Plos one, 6, e25297, (2011)
[31] Huang, T., Shi, X. H., Wang, P., He, Z., Feng, K.Y., Hu, L., Kong, X., Li, Y.X., Cai, Y.D., Chou, K.C., 2010. Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks PLoS ONE 5, e10972.
[32] Jimenez-Montano, M.A., On the syntactic structure of protein sequences and the concept of grammar complexity, Bull. math. biol., 46, 4, 641-659, (1984) · Zbl 0552.92008
[33] Lin, W.Z.; Fang, J.A.; Xiao, X.; Chou, K.C., Idna-prot: identification of DNA binding proteins using random forest with grey model, Plos one, 6, e24756, (2011)
[34] Martin, L.C.; Gloor, G.B.; Dunn, S.D.; Wahl, L.M., Using information theory to search for co-evolving residues in proteins, Bioinformatics, 21, 22, 4116-4124, (2005)
[35] Mohabatkar, H., Prediction of cyclin proteins using chou’s pseudo amino acid composition, Protein pept. lett., 17, 1207-1214, (2010)
[36] Mohabatkar, H.; Mohammad Beigi, M.; Esmaeili, A., Prediction of GABA(A) receptor proteins using the concept of chou’s pseudo-amino acid composition and support vector machine, J. theor. biol., 281, 18-23, (2011)
[37] Shi, X.H.; Liang, Y.C.; Lee, H.P.; Lin, W.Z.; Xu, X.; Lim, S.P., Improved elman networks and applicationsfor controlling ultrasonic motors, Appl. artif. intell., 18, 603-629, (2004)
[38] Wang, P.; Xiao, X.; Chou, K.C., NR-2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features, Plos one, 6, e23505, (2011)
[39] Weiss, O.; Jimenez-Montano, M.A.; Herzel, H., Information content of protein sequences, J. theor. biol., 206, 3, 379-386, (2000)
[40] Wu, Z.C.; Xiao, X.; Chou, K.C., Iloc-plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. biosyst., 7, 3287-3297, (2011)
[41] Wu, Z.C.; Xiao, X.; Chou, K.C., Iloc-gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins, Protein pept. lett., 19, 4-14, (2012)
[42] Xiao, X.; Wang, P.; Chou, K.C., GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions, Mol. biosyst., 7, 911-919, (2011)
[43] Xiao, X.; Wang, P.; Chou, K.C., Inr-physchem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix, Plos one, 7, e30869, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.