Prediction of protein structure classes by incorporating different protein descriptors into general Chou’s pseudo amino acid composition. (English) Zbl 1343.92387

Summary: Successful protein structure identification enables researchers to estimate the biological functions of proteins, yet it remains a challenging problem. The most common method for determining an unknown proteins structural class is to perform expensive and time-consuming manual experiments. Because of the availability of amino acid sequences generated in the post-genomic age, it is possible to predict an unknown proteins structural class using machine learning methods given a proteins amino-acid sequence and/or its secondary structural elements. Following recent research in this area, we propose a new machine learning system that is based on combining several protein descriptors extracted from different protein representations, such as position specific scoring matrix (PSSM), the amino-acid sequence, and secondary structural sequences. The prediction engine of our system is operated by an ensemble of support vector machines (SVMs), where each SVM is trained on a different descriptor. The results of each SVM are combined by sum rule. Our final ensemble produces a success rate that is substantially better than previously reported results on three well-established datasets. The MATLAB code and datasets used in our experiments are freely available for future comparison at http://www.dei.unipd.it/node/2357.


92D20 Protein sequences, DNA sequences
92-08 Computational methods for problems pertaining to biology
Full Text: DOI


[1] Anfinsen, C., Principles that govern the folding of protein chains, Science, 181, 223-230, (1973)
[2] Birzele, F.; Kramer, S., A new representation for protein secondary structure prediction based on frequent patterns, Bioinformatics, 22, 2628-2634, (2006)
[3] Bu, W. S., Prediction of protein (domain) structural classes based on amino-acid index, Eur. J. Biochem., 266, 1043-1049, (1999)
[4] Cao, D. S.; Xu, Q. S, Propy: a tool to generate various modes of chou’s pseaac, Bioinformatics, 29, 960-962, (2013)
[5] Chen, L; Zeng, W. M., Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities, PLoS One, 7, e35254, (2012)
[6] Chen W, Feng PM, et al. (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res. 41: e69: open access at 〈http://dx.doi.org/doi:10.1093/nar/gks1450〉
[7] Chou, K.-C., A novel approach to predicting protein structural classes in a (20-1)-{\scd} amino acid composition space, Proteins, 21, 319-344, (1995)
[8] Chou, K.-C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins: Struct., Fucnt. Genet., 43, 246-255, (2001)
[9] Chou, KC, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 10-19, (2005)
[10] Chou, K.-C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. Proteom., 6, 262-274, (2009)
[11] Chou, K.-C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[12] Chou, KC, Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst., 9, 1092-1100, (2013)
[13] Chou, K.-C.; Cai, Y. D., Predicting protein structural class by functional domain composition, Biochem. Biophys. Res. Commun., 321, 1007-1009, (2004)
[14] Chou, K.-C.; Shen, H. B., Review: recent progresses in protein subcellular location prediction, Anal. Biochem., 370, (2007)
[15] Chou, K.-C. Shen, H.B. (2009) Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 2, 63-92 (openly accessible at 〈http://www.scirp.org/journal/NS/〉).
[16] Cristianini, N.; Shawe-Taylor, J., An introduction to support vector machines and other kernel-based learning methods, (2000), Cambridge University Press Cambridge, UK
[17] Dai, Q., Comparison study on statistical features of predicted secondary structures for protein structural class prediction: from content to position, BMC Bioinf., 14, 152, (2013)
[18] Ding, S., A novel protein structural classes prediction method based on predicted secondary structure, Biochimie, 94, 1166-1171, (2012)
[19] Du, P; Wang, X, Pseaac-builder: a cross-platform stand-alone program for generating various special chou’s pseudo9amino acid compositions, Anal. Biochem., 425, 117-119, (2012)
[20] Du, P; Gu, S, Pseaac-general: fast building various modes of general form of chou’s pseudo9amino acid composition for large9scale protein datasets, Int. J. Mol. Sci., 15, 3495-3506, (2014)
[21] Fan, G.-L.; Li, Q.-Z., Predicting protein submitochondrion locations by combining different descriptors into the general form of chou’s pseudo amino acid composition, Amino Acids, 20, 1-11, (2011)
[22] Ghanty, P.; Pal, N. R., Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers, IEEE Trans. Nanobiosci., 8, 100-110, (2009)
[23] Gribskov, M.; McLachlan, A. D.; Eisenberg, D., Profile analysis: detection of distantly related proteins, Proc. Nat. Acad. Sci. (PNAS), 4355-4358, (1987)
[24] Jeong, J. C.; Lin, X.; Chen, X.-W., On position-specific scoring matrix for protein function prediction, IEEE/ACM Trans. Comput. Biol. Bioinf., 8, 308-315, (2011)
[25] Jones, D. T., Protein secondary structure prediction based on position specific scoring matrices, J. Mol. Biol., 292, 195-202, (1999)
[26] Kawashima, S.; Kanehisa, M., Aaindex: amino acid index database, Nucleic Acids Res., 20, (2000)
[27] Kong, L.; Zhang, L.; Lv, J., Accurate prediction of protein structural classes by incorporating predicted secondary structure information into the general form of chou’s pseudo amino acid composition, J. Theor. Biol., 344, 12-18, (2014)
[28] Kurgan, L. A.; Homaeian, L., Prediction of structural classes for protein sequences and domains-impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy, Pattern Recognit, 39, 2323-2343, (2006) · Zbl 1103.68767
[29] Kurgan, L. A.; Cios, K.; Chen, K., SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences, BMC Bioinf., 9, 226, (2008)
[30] Levitt, M.; Chothia, C., Structural patterns in globular proteins, Nature, 261, 552-558, (1976)
[31] Lin, SX; Lapointe, J, Theoretical and experimental biology in one, J. Biomed. Sci. Eng. (JBiSE), 6, 435-442, (2013)
[32] Liu, T.; Jia, C., A high-accuracy protein structural class prediction algorithm using predicted secondary structural information, J. Theor. Biol., 267, 272-275, (2010)
[33] Mizianty, M. J.; Kurgan, L. A., Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences, Bioinforma, 10, 414, (2009)
[34] Mizianty, MJ; Kurgan, L, Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences, BMC Bioinf., 10, 414, (2009)
[35] Nakashima, H.; Nishikawa, K.; Ooi, T., The folding type of a protein is relevant to the amino acid composition, J. Biochem., 99, 153-162, (1986)
[36] Nanni, L.; Brahnam, S.; Lumini, A., A high performance set of pseaac descriptors extracted from the amino acid sequence for protein classification, J. Theor. Biol., 266, 1-10, (2010)
[37] Paliwal, K. K.; Sharma, A.; Lyons, J.; Dehzangi, A., A tri-Gram based feature extraction technique using linear probabilities of position specific scoring matrix for protein fold recognition, IEEE Trans. Nanobiosci., 44-50, March, (2014)
[38] Rodriguez, J. J.; Kuncheva, L. I.; Alonso, C. J., Rotation forest: a new classifier ensemble method, IEEE Trans. Pattern Anal. Mach. Intell., 28, 1619-1630, (2006)
[39] Rost, B.; Sander, C., Bridging the protein by structure predictions sequence-structure gap, Annu. Rev. Biophys. Biomol. Struct., 25, 113-136, (1996)
[40] Shen, HB; Chou, KC, Pseaac: a flexible web-server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem., 373, 386-388, (2008)
[41] Sharma, A., A feature extraction technique using bi-Gram probabilities of position specific scoring matrix for protein fold recognition, J. Theor. Biol., 320, 41-46, (2013)
[42] Wang, Z. X.; Yuan, Z., How good is prediction of protein structural class by the component-coupled method?, Proteins, 38, 165-175, (2000)
[43] Xiao, X; Wang, P, Iamp-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types, Anal. Biochem., 436, 168-177, (2013)
[44] Xu, Y; Ding, J, Isno-pseaac: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS One, 8, e55844, (2013)
[45] Xu Y, Shao XJ, et al. (2013a) iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1: e171: open access at 〈https://peerj.com/articles/171.pdf〉
[46] Yang, J.; Peng, Z.; Chen, X., Prediction of protein structural classes for low homology sequences based on predicted secondary structure, BMC Bioinf., 11, S9, (2010)
[47] Yang, L., Using auto covariance method for functional discrimination of membrane proteins based on evolution information, Amino Acids, 38, 1497-1503, (2010)
[48] Yu, X., Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation, Amino Acids, (2011)
[49] Yuan, Z; Huang, B, Prediction of protein accessible surface areas by support vector regression, Proteins, 57, 558-564, (2004)
[50] Zhang, S.; Ding, S.; Wang, T., High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure, Biochimie, 93, 710-714, (2011)
[51] Zhou, HF; Javad, R; Willy, H; Gao, S.; Jin, J.; Fan, M.; Yong, CH; Wozniak, M; Wong., L, Stringent DDI-based prediction of H. sapiens-M. tuberculosis H37rv protein-protein interactions, BMC Syst. Biol., 7, no. 6, 1-15, (2013), (2013)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.