Machine learning approaches for discrimination of extracellular matrix proteins using hybrid feature space. (English) Zbl 1343.92007

Summary: Extracellular matrix (ECM) proteins are the vital type of proteins that are secreted by resident cells. ECM proteins perform several significant functions including adhesion, differentiation, cell migration and proliferation. In addition, ECM proteins regulate angiogenesis process, embryonic development, tumor growth and gene expression. Due to tremendous biological significance of the ECM proteins and rapidly increases of protein sequences in databases, it is indispensable to introduce a new high throughput computation model that can accurately identify ECM proteins. Various traditional models have been developed, but they are laborious and tedious. In this work, an effective and high throughput computational classification model is proposed for discrimination of ECM proteins. In this model, protein sequences are formulated using amino acid composition, pseudo amino acid composition (PseAAC) and di-peptide composition (DPC) techniques. Further, various combination of feature extraction techniques are fused to form hybrid feature spaces. Several classifiers were employed. Among these classifiers, K-Nearest Neighbor obtained outstanding performance in combination with the hybrid feature space of PseAAC and DPC. The obtained accuracy of our proposed model is 96.76%, which the highest success rate has been reported in the literature so far.


92B15 General biostatistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
Full Text: DOI


[1] Ahmad, K.; Waris, M.; Hayat, M., Prediction of protein submitochondrial locations by incorporating dipeptide composition into Chou’s general pseudo amino acid composition, J Membr. Biol., (2016)
[2] Akbar, S.; Hayat, M.; Ahmad, A., Identification of fingerprint using discrete wavelet transform in conjunction with support vector machine, IJCSI Int. J. Comput. Sci., 11, 189-199, (2014)
[3] Akkus, A.; Guvenir, H. A., K nearest neighbor classification on feature projections, Proc. ICML, 96, 12-19, (1995)
[4] Ali, F.; Hayat, M., Classification of membrane protein types using voting feature interval in combination with chou׳ s pseudo amino acid composition, J. Theor. Biol., 384, 78-83, (2015) · Zbl 1343.92006
[5] Anitha, J.; Rejimoan, R.; Sivakumar, K. C.; Sathish, M., Prediction of extracellular matrix proteins using svmhmm classifier, IJCA Spec. Issue Adv. Comput. Commun. Technol. HPC Appl., 1, 7-11, (2012)
[6] Bhasin, M.; Raghava, G., Eslpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST, Nucleic Acids Res., 32, W414-W419, (2004)
[7] Blanz, V., Schölkopf, B., Bülthoff, H., Burges, C., Vapnik, V., Vetter, T., 1996. Comparison of view-based object recognition algorithms using realistic 3D models. Artificial Neural Networks—ICANN 96, Springer, pp. 251-256.
[8] Breiman, L., Random forests, Mach. Learn., 45, 5-32, (2001) · Zbl 1007.68152
[9] Cai, Y. D., A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology, Biochem. Biophys. Res. Commun., 311, 743-747, (2003)
[10] Cai, Y. D.; Zhou, G. P.; Chou, K. C., Support vector machines for predicting membrane protein types by using functional domain composition, Biophys. J., 84, 3257-3263, (2003)
[11] Cao, D.-S.; Xu, Q.-S.; Liang, Y.-Z., Propy: a tool to generate various modes of Chou’s pseaac, Bioinformatics, 29, 960-962, (2013)
[12] Chan, J. F.; Lau, S. K.; To, K. K.; Cheng, V. C.; Woo, P. C.; Yuen, K.-Y., Middle east respiratory syndrome coronavirus: another zoonotic betacoronavirus causing SARS-like disease, Clin. Microbiol. Rev., 28, 465-522, (2015)
[13] Chen, C.; Chen, L.; Zou, X.; Cai, P., Prediction of protein secondary structure content by using the concept of chou’s pseudo amino acid composition and support vector machine, Protein Pept. Lett., 16, 27-31, (2009)
[14] Chen, Y. L., Prediction of apoptosis protein subcellular location using improved hybrid approach and pseudo-amino acid composition, J. Theor. Biol., 248, 2, 377-381, (2007)
[15] Chou, K.-C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins: Struct. Funct. Bioinform., 43, 246-255, (2001)
[16] Chou, K.-C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[17] Chou, K.-C.; Zhang, C.-T., Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol., 30, 275-349, (1995)
[18] Chou, K.-C.; Cai, Y.-D., Prediction of membrane protein types by incorporating amphipathic effects, J. Chem. Inf. Model., 45, 407-413, (2005)
[19] Dehzangi, A., Khosravi, B.G., Introducing novel physicochemical based features to enhance protein fold prediction accuracy, in: Proceedings of 2010 International Conference on Computer Design and Applications (ICCDA), Vol. 1., IEEE 2010, pp. V1-592-V1-596.
[20] Dehzangi, A., Sattar, A., 2013. Protein fold recognition using segmentation-based feature extraction model, Intelligent Information and Database Systems, Springer, , pp. 345-354.
[21] Di-Lullo, G. A.; Sweeney, S. M.; Korkko, J.; Ala-Kokko, L.; Antonio, J. D.S., Mapping the ligand-binding sites and disease-associated mutations on the most abundant protein in the human, type I collagen, J. Biol. Chem., 277, 4223-4231, (2002)
[22] Du, P.; Gu, S.; Jiao, Y., Pseaac-general: fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets, Int. J. Mol. Sci., 15, 3495-3506, (2014)
[23] Du, P.; Wang, X.; Xu, C.; Gao, Y., Pseaac-builder: a cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions, Anal. Biochem., 425, 117-119, (2012)
[24] Duda, R. O.; Hart, P. E.; Stork, D. G., Pattern classification, (2012), John Wiley & Sons California
[25] Freund, Y.; Schapire, R. E., A decision-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci., 55, 119-139, (1997) · Zbl 0880.68103
[26] Gnanasivam, P.; Muttan, S., Fingerprint gender classification using wavelet transform and singular value decomposition, Int. J. Comput. Sci. Issues, 9, 99-104, (2012)
[27] Guo, S.-H.; Deng, E.-Z.; Xu, L.-Q.; Ding, H.; Lin, H.; Chen, W.; Chou, K.-C., Inuc-pseknc: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition, Bioinformatics, (2014), btu083
[28] Gupta, S.; Ansari, H. R.; Gautam, A.; Raghava, G., Identification of B-cell epitopes in an antigen for inducing specific class of antibodies, Biol. Direct, 8, 27, (2013)
[29] Gurvan, M.; Tonon, T.; Scornet, D.; Mark, J.; Kloareg, B., The cell wall polysaccharide metabolism of the Brown alga ectocarpus siliculosus. insights into the evolution of extracellular matrix polysaccharides in eukaryotes, New Phytol., 188, 82-97, (2010)
[30] Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V., Gene selection for cancer classification using support vector machines, Mach. Learn., 46, 389-422, (2002) · Zbl 0998.68111
[31] Hayat, M.; Khan, A., Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition, J. Theor. Biol., 271, 10-17, (2011) · Zbl 1405.92217
[32] Hayat, M.; Khan, A., Prediction of membrane protein types by using dipeptide and pseudo amino acid composition based composite features, IET Commun., 6, 3257-3264, (2012)
[33] Hayat, M.; Khan, A.; Yeasin, M., Prediction of membrane proteins using split amino acid and ensemble classification, Amino Acids, 42, 2447-2460, (2012)
[34] Hensch, T. K., Critical period mechanisms in developing visual cortex, Curr. Top. Dev. Biol., 69, 215-237, (2005)
[35] Horton, P.; Nakai, K., Better prediction of protein cellular localization sites with the it k nearest neighbors classifier, ISMB, 5, 147-152, (1997)
[36] Huang, T.; Niu, S.; Xu, Z.; Huang, Y., Predicting transcriptional activity of multiple site p53 mutants based on hybrid properties, PLoS One, 6, e22940, (2011)
[37] Iqbal, M.; Hayat, M., “iss-hyb-mrmr”: identification of splicing sites using hybrid space of trinucleotide composition and tetranucleotide composition, J. Comput. Methods Prog. Biomed., 128, 1-11, (2016)
[38] Jennings, N. R.; Sycara, K.; Wooldridge, M., A roadmap of agent research and development, Auton. Agents Multi-agent Syst., 1, 7-38, (1998)
[39] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., Ippi-esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac, J. Theor. Biol., 377, 47-56, (2015)
[40] Jung, J.; Ryu, T.; Hwang, Y.; Lee, E.; Lee, D., Prediction of extracellular matrix proteins based on distinctive sequence and domain characteristics, J. Comput. Biol., 17, 97-105, (2010)
[41] Kabir, M.; Hayat, M., Irspot-gaensc: identifying recombination spots via ensemble classifier and extending the concept of Chou’s pseaac to formulate DNA samples, Mol. Genet. Genom., (2015)
[42] Kalita, M. K.; Nandal, U. K.; Pattnaik, A.; Sivalingam, A.; Ramasamy, G.; Kumar, M.; Raghava, G. P.; Gupta, D., Cyclinpred: a SVM-based method for predicting cyclin protein sequences, PLoS One, 3, (2008), e2605_1-e2605_12
[43] Kandaswamy, K. K.; Pugalenthi, G.; Kalies, K. U.; Hartmann, E.; Martinetz, T., Ecmpred: prediction of extracellular matrix proteins based on random forest with maximum relevance minimum redundancy feature selection, J. Theor. Biol., 317, 377-383, (2013)
[44] Karsenty, G.; Park, R. W., Regulation of type I collagen genes expression, Int. Rev. Immunol., 12, 177-185, (1995)
[45] Kern, B.; Shen, J.; Starbuck, M.; Karsenty, G., Cbfa1 contributes to the osteoblast-specific expression of type I collagen genes, J. Biol. Chem., 276, 7101-7107, (2001)
[46] Kibriya, A.M., Frank, E., Pfahringer, B., and Holmes, G., 2005. Multinomial naive bayes for text categorization revisited, AI 2004: Advances in Artificial Intelligence, Springer, pp. 488-499.
[47] Kononenko, I., Machine learning for medical diagnosis: history, state of the art and perspective, Artif. Intell. Med., 23, 89-109, (2001)
[48] Leslie, C. S.; Eskin, E.; Cohen, A.; Weston, J.; Noble, W. S., Mismatch string kernels for discriminative protein classification, Bioinformatics, 20, 467-476, (2004)
[49] Li, D. Y.; Brooke, B.; Davis, E. C.; Mecham, R. P.; Sorensen, L. K.; Boak, B. B.; Eichwald, E.; Keating, M. T., Elastin is an essential determinant of arterial morphogenesis, Nature, 393, 276-280, (1998)
[50] Liao, Y.; Vemuri, V. R., Use of k-nearest neighbor classifier for intrusion detection, Comput. Secur., 21, 439-448, (2002)
[51] Lin, H.; Ding, H.; Guo, F. B.; Zhang, A. Y.; Huang, J., Predicting subcellular localization of mycobacterial proteins by using Chou’s pseudo amino acid composition, Protein Pept. Lett., 15, 739-744, (2008)
[52] Lin, H.; Deng, E.-Z.; Ding, H.; Chen, W.; Chou, K.-C., Ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition, Nucleic Acids Res., 42, 12961-12972, (2014)
[53] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.-C., Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res., (2015), gkv458
[54] Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J., Wang, X., and Chou, K.-C., 2014. iDNA-Prot| dis: Identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition.
[55] Mandle, A. K.; Jain, P.; Shrivastava, S. K., Protein structure prediction using support vector machine, Int. J. Soft Comput., 3, 67-78, (2012)
[56] Mei, S., Predicting plant protein subcellular multi-localization by chou’s pseaac formulation based multi-label homolog knowledge transfer learning, J. Theor. Biol., 310, 80-87, (2012) · Zbl 1337.92065
[57] Meir, R., Rätsch, G., 2003. An introduction to boosting and leveraging. Advanced lectures on machine learning, Springer, pp. 118-183. · Zbl 1019.68092
[58] Mohabatkar, H.; M., B.; A., E., Prediction of GABA(A) receptor proteins using the concept of chou’s pseudo-amino acid composition and support vector machine, J. Theor. Biol., 281, 18-23, (2011) · Zbl 1397.92215
[59] Mondal, S.; Bhavna, R.; Babu, R.; Ramakumar, S., Pseudo amino acid composition and multi-class support vector machines approach for conotoxin superfamily classification, J. Theor. Biol., 243, 252-260, (2006)
[60] Muthukrishnan, S.; Puri, M.; Lefevre, C., Support vector machine (SVM) based multiclass prediction with basic statistical analysis of plasminogen activators, BMC Res. Notes, 7, 63, (2014)
[61] Nanni, L.; Lumini, A., Genetic programming for creating chou’s pseudo amino acid based features for submitochondria localization, Amino Acids, 34, 653-660, (2008)
[62] Nanni, L.; Lumini, A.; Gupta, D.; Garg, A., Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of chou’s pseudo amino acid composition and on evolutionary information, IEEE/ACM Trans. Comput. Biol. Bioinform., 9, 467-475, (2012)
[63] Peach, R. J.; Hollenbaugh, D.; Stamenkovic, I.; Aruffo, A., Identification of hyaluronic acid binding sites in the extracellular domain of CD44, J. Cell. Biol., 122, 257-264, (1993)
[64] Provenzano, P. P.; Inman, D. R.; Eliceiri, K. W.; Keely, P. J., Matrix density-induced mechanoregulation of breast cell phenotype, signaling and gene expression through a FAK-ERK linkage, Oncogene, 28, 4326-4343, (2009)
[65] Qu, W.; Sui, H.; Yang, B.; Qian, W., Improving protein secondary structure prediction using a multi-modal BP method, Comput. Biol. Med., 41, 946-959, (2011)
[66] Rish, I., 2001. An empirical study of the naive Bayes classifier. IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Vol. 3. IBM, New York, pp. 41-46
[67] Rosenbloom, J.; Abrams, W. R.; Mecham, R., Extracellular matrix 4: the elastic fiber, FASEB J., 7, 1208-1218, (1993)
[68] Sarangi, A. N.; Lohani, M.; Aggarwal, R., Prediction of essential proteins in prokaryotes by incorporating various physico-chemical features into the general form of Chou’s pseudo amino acid composition, Protein Pept. Lett., 20, 781-795, (2013)
[69] Schmidt, D.C., Cranor, C.D., 1996. Half-Sync/Half-Async - An Architectural Pattern for Efficient and Well-structured Concurrent I/O
[70] Schölkopf, B., Burges, C., Vapnik, V., 1996. Incorporating invariances in support vector learning machines. Artificial Neural Networks—ICANN 96, Springer, pp. 47-52
[71] Schölkopf, B.; Sung, K.-K.; Burges, C. J.; Girosi, F.; Niyogi, P.; Poggio, T.; Vapnik, V., Comparing support vector machines with Gaussian kernels to radial basis function classifiers, IEEE Trans. Signal Process., 45, 2758-2765, (1997)
[72] Shen, H.-B.; Chou, K.-C., Pseaac: a flexible web server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem., 373, 386-388, (2008)
[73] Shen, H.; Chou, K.-C., Using optimized evidence-theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types, Biochem. Biophys. Res. Commun., 334, 288-292, (2005)
[74] Shin, P., Jasso, H., Tilak, S., Cotofana, N., Fountain, T., Yan, L., Fraser, M., Elgamal, A., Automatic vehicle type classification using strain gauge sensors. in: Proceedings of the Fifth Annual IEEE International Conference on Pervasive Computing and Communications Workshops, 2007. PerCom Workshops’ 07. IEEE 2007, pp. 425-428.
[75] Soni, J.; Ansari, U.; Sharma, D.; Soni, S., Predictive data mining for medical diagnosis: an overview of heart disease prediction, Int. J. Comput. Appl., 17, 43-48, (2011)
[76] Sun, X. Y.; Shi, S. P.; Qiu, J. D.; Suo, S. B.; Huang, S. Y.; Liang, R. P., Identifying protein quaternary structural attributes by incorporating physicochemical properties into the general form of chou’s pseaac via discrete wavelet transform, Mol. Biosyst., 8, 3178-3184, (2012)
[77] Thusberg, J.; Olatubosun, A.; Vihinen, M., Performance of mutation pathogenicity prediction methods on missense variants, Hum. Mutat., 32, 358-368, (2011)
[78] Van Dyk, H.; Barnard, E., Naive Bayesian classifiers for multinomial features: a theoretical analysis: pattern recognition special edition, South Afr. Comput. J., 40, 37-43, (2008)
[79] Vapnik, V., The nature of statistical learning theory, IEEE, (1995) · Zbl 0833.62008
[80] Xiao, X.; Wang, P., Predicting protein quaternary structural attribute by hybridizing functional domain composition and pseudo amino acid composition, J. Appl. Crystallogr., 42, 169-173, (2009)
[81] Xiao, X.; Wang, P., GPCR-2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions, Mol. Biosyst., 7, 911-919, (2011)
[82] Yang, R.; Zhang, C.; Gao, R.; Zhang, L., An ensemble method with hybrid features to identify extracellular matrix proteins, PLoS One, 10, 1-21, (2015)
[83] Yuan, Z., Better prediction of protein contact number using a support vector regression analysis of amino acid sequence, BMC Bioinform., 6, 248, (2005)
[84] Zahoor, J.; Abrar, M.; Hussain, D., Seasonal to inter-annual climate prediction using data mining KNN technique, 20, 40-51, (2008), Springer-Verlag Berlin Heidelberg
[85] Zhang, G. Y.; Fang, B. S., Using the concept of chou’s pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von Neumann entropies, Amino Acids, 34, 565-572, (2008)
[86] Zhang, J.; Sun, P.; Zhao, X.; Ma, Z., PECM: prediction of extracellular matrix proteins using the concept of Chou’s pseudo amino acid composition, J. Theor. Biol., 363, 412-418, (2014)
[87] Zhou, G. P.; Cai, Y. D., Predicting protease types by hybridizing gene ontology and pseudo amino acid composition, Proteins, 63, 681-684, (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.