×

zbMATH — the first resource for mathematics

Comprehensive comparative analysis and identification of RNA-binding protein domains: multi-class classification and feature selection. (English) Zbl 1337.92062
Summary: RNA-protein interaction plays an important role in various cellular processes, such as protein synthesis, gene regulation, post-transcriptional gene regulation, alternative splicing, and infections by RNA viruses. In this study, using gene ontology annotated (GOA) and structural classification of proteins (SCOP) databases an automatic procedure was designed to capture structurally solved RNA-binding protein domains in different subclasses. Subsequently, we applied tuned multi-class SVM (TMCSVM), random forest (RF), and multi-class \(\ell_1/\ell_q\)-regularized logistic regression (MCRLR) for analysis and classifying RNA-binding protein domains based on a comprehensive set of sequence and structural features. In this study, we compared prediction accuracy of three different state-of-the-art predictor methods. From our results, TMCSVM outperforms the other methods and suggests the potential of TMCSVM as a useful tool for facilitating the multi-class prediction of RNA-binding protein domains. On the other hand, MCRLR by elucidating importance of features for their contribution in predictive accuracy of RNA-binding protein domains subclasses, helps us to provide some biological insights into the roles of sequences and structures in protein-RNA interactions.
MSC:
92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Ahmad, S.; Gromiha, M.; Fawareh, H.; Sarai, A., Asaview: database and tool for solvent accessibility representation in proteins, BMC bioinf., 5, 51, (2004)
[2] Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; Harris, M.A.; Hill, D.P.; Issel-Tarver, L.; Kasarskis, A.; Lewis, S.; Matese, J.C.; Richardson, J.E.; Ringwald, M.; Rubin, G.M.; Sherlock, G., Gene ontology: tool for the unification of biology, Nat. genet., 25, 25-29, (2000)
[3] Bach, F., Consistency of the group lasso and multiple kernel learning, J. Mach. learn. res., 9, 1179-1225, (2008) · Zbl 1225.68147
[4] Breiman, L., Random forests, Mach. learn., 45, 5-32, (2001) · Zbl 1007.68152
[5] Chen, C.; Chen, L.; Zou, X.; Cai, P., Prediction of protein secondary structure content by using the concept of Chou’s pseudo amino acid composition and support vector machine, Protein pept. lett., 16, 27-31, (2009)
[6] Chen, Y.; Varani, G., Protein families and RNA recognition, Febs j., 272, 2088-2097, (2005)
[7] Chou, K.C., Prediction of protein cellular attributes using pseudo amino acid composition, Proteins: struct. funct. genet., 43, 246-255, (2001), (Erratum: ibid, 2001, vol. 44, 60)
[8] Chou, K.C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 10-19, (2005)
[9] Chou, K.C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. proteomics, 6, 262-274, (2009)
[10] Chou, K.C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. theor. biol., 273, 236-247, (2011) · Zbl 1405.92212
[11] Chou, K.C.; Shen, H.B., Hum-ploc: a novel ensemble classifier for predicting human protein subcellular localization, Biochem. biophys. res. commun., 347, 150-157, (2006)
[12] Chou, K.C.; Shen, H.B., Cell-ploc: a package of web servers for predicting subcellular localization of proteins in various organisms (updated version: cell-ploc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms, natural science, 2010, 2, 1090-1103), Nat. protocols, 3, 153-162, (2008)
[13] Chou, K.C.; Shen, H.B., Review: recent advances in developing web-servers for predicting protein attributes, Nat. sci., 2, 63-92, (2009), openly accessible at
[14] Chou, K.C.; Wu, Z.C.; Xiao, X., Iloc-euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, Plos one, 6, e18258, (2011)
[15] Chou, K.C.; Wu, Z.C.; Xiao, X., Iloc-hum: using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. biosyst., 8, 629-641, (2012)
[16] Chou, K.C.; Zhang, C.T., Review: prediction of protein structural classes, Crit. rev. biochem. mol. biol., 30, 275-349, (1995)
[17] Ding, H.; Luo, L.; Lin, H., Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition, Protein pept. lett., 16, 351-355, (2009)
[18] Duchi, J.; Singer, Y., Online and batch learning using forward backward splitting, J. Mach. learn. res., 10, 2899-2934, (2009) · Zbl 1235.62151
[19] Dudoit, S.; Fridlyan, J.; Fridlyan, T.P., Comparison of discrimination methods for the classification of tumors using gene expression data, J. am. stat. assoc., 97, 77-87, (2002) · Zbl 1073.62576
[20] Ellis, J.J.; Broom, M.; Jones, S., Protein-RNA interactions: structural analysis and functional classes, Proteins, 66, 903-911, (2007)
[21] Esmaeili, M.; Mohabatkar, H.; Mohsenzadeh, S., Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses, J. theor. biol., 263, 203-209, (2010)
[22] Georgiou, D.N.; Karakasidis, T.E.; Nieto, J.J.; Torres, A., Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition, J. theor. biol., 257, 17-26, (2009)
[23] Gu, Q.; Ding, Y.S.; Zhang, T.L., Prediction of G-protein- coupled receptor classes in low homology using Chou’s pseudo amino acid composition with approximate entropy and hydrophobicity patterns, Protein pept. lett., 17, 559-567, (2010)
[24] Guo, J.; Rao, N.; Liu, G.; Yang, Y.; Wang, G., Predicting protein folding rates using the concept of Chou’s pseudo amino acid composition, J. comput. chem., 32, 1612-1617, (2011)
[25] Han, L.Y.; Cai, C.Z.; Lo, S.L.; Chung, M.C.; Chen, Y.Z., Prediction of RNA-binding proteins from primary sequence by a support vector machine approach, RNA, 10, 355-368, (2004)
[26] Hayat, M.; Khan, A., Discriminating outer membrane proteins with fuzzy K-nearest neighbor algorithms based on the general form of Chou’s pseaac, Protein pept. lett., 19, 411-421, (2012)
[27] Hu, L.; Zheng, L.; Wang, Z.; Li, B.; Liu, L., Using pseudo amino acid composition to predict protease families by incorporating a series of protein biological features, Protein pept. lett., 18, 552-558, (2011)
[28] Jia, S.C.; Hu, X.Z., Using random forest algorithm to predict beta-hairpin motifs, Protein pept. lett., 18, 609-617, (2011)
[29] Jiang, X.; Wei, R.; Zhang, T.L.; Gu, Q., Using the concept of Chou’s pseudo amino acid composition to predict apoptosis proteins subcellular location: an approach by approximate entropy, Protein pept. lett., 15, 392-396, (2008)
[30] Jones, S.; Daley, D.T.; Luscombe, N.M.; Berman, H.M.; Thornton, J.M., Protein-RNA interactions: a structural analysis, Nucleic acids res., 29, 943-954, (2001)
[31] Kabsch, W.; Sander, C., Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers., 22, 12, 2577-2637, (1983)
[32] Kabsch, W.; Sander, C., Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, 22, 2577-2637, (1993)
[33] Kandaswamy, K.K.; Chou, K.C.; Martinetz, T.; Moller, S.; Suganthan, P.N.; Sridharan, S.; Pugalenthi, G., AFP-pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties, J. theor. biol., 270, 56-62, (2011)
[34] Kowalski, M., Sparse regression using mixed norms, Appl. comput. harmonic anal., 27, 303-324, (2009) · Zbl 1183.94012
[35] Li, B.Q.; Huang, T.; Liu, L.; Cai, Y.D.; Chou, K.C., Identification of colorectal cancer related genes with mrmr and shortest path in protein – protein interaction network, Plos one, 7, e33393, (2012)
[36] Li, F.M.; Li, Q.Z., Predicting protein subcellular location using Chou’s pseudo amino acid composition and improved hybrid approach, Protein pept. lett., 15, 612-616, (2008)
[37] Li, L.Q.; Zhang, Y.; Zou, L.Y.; Zhou, Y.; Zheng, X.Q., Prediction of protein subcellular multi-localization based on the general form of Chou’s pseudo amino acid composition, Protein pept. lett., 19, 375-387, (2012)
[38] Liaw, A.; Wiener, M., Classification and regression by randomforest, R news, 2, 18-22, (2002)
[39] Lin, H., The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou’s pseudo amino acid composition, J. theor. biol., 252, 350-356, (2008)
[40] Lin, H.; Ding, H.; Feng-Biao Guo, F.B.; Zhang, A.Y.; Huang, J., Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition, Protein pept. lett., 15, 739-744, (2008)
[41] Lin, J.; Wang, Y., Using a novel adaboost algorithm and Chou’s pseudo amino acid composition for predicting protein subcellular localization, Protein pept. lett., 18, 1219-1225, (2011)
[42] Lin, W.Z.; Fang, J.A.; Xiao, X.; Chou, K.C., Idna-prot: identification of DNA binding proteins using random forest with grey model, Plos one, 6, e24756, (2011)
[43] Lingel, A.; Sattler, M., Novel modes of protein-RNA recognition in the rnai pathway, Curr. opin. struct. biol., 15, 107-115, (2005)
[44] Liu, L.; Hu, X.Z.; Liu, X.X.; Wang, Y.; Li, S.B., Predicting protein fold types by the general form of Chou’s pseudo amino acid composition: approached from optimal feature extractions, Protein pept. lett., 19, 439-449, (2012)
[45] Lunde, B.M.; Moore, C.; Varani, G., RNA-binding proteins: modular design for efficient function, Nat. rev. mol. cell. biol., 8, 479-490, (2007)
[46] Mei, S., Multi-kernel transfer learning based on Chou’s pseaac formulation for protein submitochondria localization, J. theor. biol., 293, 121-130, (2012) · Zbl 1307.92085
[47] Mohabatkar, H., Prediction of cyclin proteins using Chou’s pseudo amino acid composition, Protein pept. lett., 17, 1207-1214, (2010)
[48] Mohabatkar, H.; Mohammad Beigi, M.; Esmaeili, A., Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine, J. theor. biol., 281, 18-23, (2011)
[49] Morozova, N.; Allers, J.; Myers, J.; Shamoo, Y., Protein-RNA interactions: exploring binding patterns with a three-dimensional superposition analysis of high resolution structures, Bioinformatics, 22, 2746-2752, (2006)
[50] Nakamura, Y.; Ito, K., Making sense of mimic in translation termination, Trends biochem sci., 28, 2, 99-105, (2003), (review)
[51] Nanni, L.; Lumini, A.; Gupta, D.; Garg, A., Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of Chou’s pseudo amino acid composition and on evolutionary information, IEEE/ACM trans. comput. biol. bioinf., 9, 467-475, (2012)
[52] Negahban, S., Ravikumar, P., Wainwright, M., Yu, B., 2009. A unified framework for high dimensional analysis of m-estimators with decomposable regularizers. Advances in Neural Information Processing Systems, pp. 1348-1356. · Zbl 1331.62350
[53] Parker, J.S.; Barford, D., Argonaute: a scaffold for the function of short regulatory rnas, Trends biochem. sci., 31, 622-630, (2006)
[54] Pugalenthi, G.; Kandaswamy, K.K.; Chou, K.C.; Vivekanandan, S.; Kolatkar, P., RSARF: prediction of residue solvent accessibility from protein sequence using random forest method, Protein pept. lett., 19, 50-56, (2012)
[55] Qin, Y.F.; Wang, C.H.; Yu, X.Q.; Zhu, J.; Liu, T.G., Predicting protein structural class by incorporating patterns of over- represented k-mers into the general form of Chou’s pseaac, Protein pept. lett., 19, 388-397, (2012)
[56] Qiu, J.D.; Huang, J.H.; Liang, R.P.; Lu, X.Q., Prediction of G-protein-coupled receptor classes based on the concept of Chou’s pseudo amino acid composition: an approach from discrete wavelet transform, Anal. biochem., 390, 68-73, (2009)
[57] Qiu, J.D.; Huang, J.H.; Shi, S.P.; Liang, R.P., Using the concept of Chou’s pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform, Protein pept. lett., 17, 715-722, (2010)
[58] Qiu, J.D.; Suo, S.B.; Sun, X.Y.; Shi, S.P.; Liang, R.P., Oligopred: a web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into Chou’s pseudo amino acid composition, J. mol. graphics modell., 30, 129-134, (2011)
[59] Qiu, Z.; Wang, X., Improved prediction of protein ligand-binding sites using random forests, Protein pept. lett., 18, 1212-1218, (2011)
[60] Shameer, K.; Pugalenthi, G.; Kandaswamy, K.K.; Sowdhamini, R., 3dswap-pred: prediction of 3D domain swapping from protein sequence using random forest approach, Protein pept. lett., 18, 1010-1020, (2011)
[61] Shao, X.; Tian, Y.; Wu, L.; Wang, Y.; Jing, L.; Deng, N., Predicting DNA- and RNA-binding proteins from sequences with kernel methods, J. theor. biol., 258, 289-293, (2009)
[62] Shazman, S.; Mandel-Gutfreund, Y., Classifying RNA-binding proteins based on electrostatic properties, Plos comput. biol., 4, e1000146, (2008)
[63] Shulman-Peleg, A., Prediction of interacting single-stranded RNA bases by protein-binding patterns, J. mol. biol., 379, 299-316, (2008)
[64] Statnikov, A.; Wang, L.; Aliferis, C.F., A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification, BMC bioinf., 9, 319, (2008)
[65] Stawiski, E.W.; Gregoret, L.M.; Mandel-Gutfreund, Y., Annotating nucleic acid-binding function based on protein structure, J. mol. biol., 326, 1065-1079, (2003)
[66] Tu, J.V., Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes, J. clin. epidemiol., 49, 1225-1231, (1996)
[67] Tworowski, D.; Feldman, A.V.; Safro, M.G., Electrostatic potential of aminoacyl-trna synthetase navigates trna on its pathway to the binding site, J. mol. biol., 350, 5, 886-982, (2005)
[68] Vapnik, V., Statistical learning theory, (1998), Wiley-Interscience New York · Zbl 0935.62007
[69] Vapnik, V.N., The nature of statistical learning theory, (1995), Springer Berlin · Zbl 0833.62008
[70] Wu, Z.C.; Xiao, X.; Chou, K.C., Iloc-plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. biosyst., 7, 3287-3297, (2011)
[71] Xiao, X.; Wang, P.; Chou, K.C., Inr-physchem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical – chemical property matrix, Plos one, 7, e30869, (2012)
[72] Xiao, X.; Wu, Z.C.; Chou, K.C., Iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. theor. biol., 284, 42-51, (2011)
[73] Xiao, X.; Wu, Z.C.; Chou, K.C., A multi-label classifier for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple sites, Plos one, 6, e20592, (2011)
[74] Yu, L.; Guo, Y.; Li, Y.; Li, G.; Li, M., Secretp: identifying bacterial secreted proteins by fusing new features into Chou’s pseudo-amino acid composition, J. theor. biol., 267, 1-6, (2010)
[75] Yu, X.; Cao, J.; Cai, Y.; Shi, T.; Li, Y., Predicting rrna-, RNA-, and DNA-binding proteins from primary structure with support vector machines, J. theor. biol., 240, 175-184, (2006)
[76] Yuan, M.; Lin, Y., Model selection and estimation in regression with grouped variables, J. R. stat. soc. ser. B, 68, 1, 49-67, (2006) · Zbl 1141.62030
[77] Zeng, Y.H.; Guo, Y.Z.; Xiao, R.Q.; Yang, L.; Yu, L.Z.; Li, M.L., Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach, J. theor. biol., 259, 366-372, (2009)
[78] Zhang, G.Y.; Fang, B.S., Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou’s amphiphilic pseudo amino acid composition, J. theor. biol., 253, 310-315, (2008)
[79] Zhang, G.Y.; Li, H.C.; Gao, J.Q.; Fang, B.S., Predicting lipase types by improved Chou’s pseudo-amino acid composition, Protein pept. lett., 15, 1132-1137, (2008)
[80] Zhao, X.W.; Li, X.T.; Ma, Z.Q.; Yin, M.H., Identify DNA-binding proteins with optimal Chou’s amino acid composition, Protein pept. lett., 19, 398-405, (2012)
[81] Zhou, X.B.; Chen, C.; Li, Z.C.; Zou, X.Y., Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes, J. theor. biol., 248, 546-551, (2007)
[82] Zou, D.; He, Z.; He, J.; Xia, Y., Supersecondary structure prediction using Chou’s pseudo amino acid composition, J. comput. chem., 32, 271-278, (2011)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.