zbMATH — the first resource for mathematics

Identifying anticancer peptides by using a generalized chaos game representation. (English) Zbl 1410.92083
Summary: We generalize chaos game representation (CGR) to higher dimensional spaces while maintaining its bijection, keeping such method sufficiently representative and mathematically rigorous compare to previous attempts. We first state and prove the asymptotic property of CGR and our generalized chaos game representation (GCGR) method. The prediction follows that the dissimilarity of sequences which possess identical subsequences but distinct positions would be lowered exponentially by the length of the identical subsequence; this effect was taking place unbeknownst to researchers. By shining a spotlight on it now, we show the effect fundamentally supports (G)CGR as a similarity measure or feature extraction technique. We develop two feature extraction techniques: GCGR-Centroid and GCGR-Variance. We use the GCGR-Centroid to analyze the similarity between protein sequences by using the datasets 9 ND5, 24 TF and 50 beta-globin proteins. We obtain consistent results compared with previous studies which proves the significance thereof. Finally, by utilizing support vector machines, we train the anticancer peptide prediction model by using both GCGR-Centroid and GCGR-Variance, and achieve a significantly higher prediction performance by employing the 3 well-studied anticancer peptide datasets.
92D20 Protein sequences, DNA sequences
92-08 Computational methods for problems pertaining to biology
91A80 Applications of game theory
Full Text: DOI
[1] Almeida, JS; Carrico, JA; Maretzek, A.; Noble, PA; Fletcher, M., Analysis of genomic sequences by chaos game representation, Bioinformatics, 17, 429-437, (2001)
[2] Basu, S.; Pan, A.; Dutta, C.; Das, J., Chaos game representation of proteins, J Mol Gr Model, 15, 279-289, (1997)
[3] Chan, HS; Dill, KA, Compact polymers, Macromolecules, 22, 4559-4573, (1989)
[4] Chang, CC; Lin, CJ, Libsvm: a library for support vector machines, ACM Trans Intell Syst Technol (TIST), 2, 27, (2011)
[5] Chang, G.; Wang, T., Phylogenetic analysis of protein sequences based on distribution of length about common substring, Protein J, 30, 167-172, (2011)
[6] Chen, K.; Kurgan, L.; Rahbari, M., Prediction of protein crystallization using collocation of amino acid pairs, Biochem Biophys Res Commun, 355, 764-769, (2007)
[7] Chen, K.; Kurgan, LA; Ruan, J., Prediction of protein structural class using novel evolutionary collocation-based sequence representation, J Comput Chem, 29, 1596-1604, (2008)
[8] Chen, W.; Ding, H.; Feng, P.; Lin, H.; Chou, KC, IACP: a sequence-based tool for identifying anticancer peptides, Oncotarget, 7, 16,895, (2016)
[9] Chen, YZ; Tang, YR; Sheng, ZY; Zhang, Z., Prediction of mucin-type o-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs, BMC Bioinform, 9, 101, (2008)
[10] Chou, KC, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins Struct Funct Bioinform, 43, 246-255, (2001)
[11] Chou, KC, Using subsite coupling to predict signal peptides, Protein Eng, 14, 75-79, (2001)
[12] Chou, KC; Zhang, CT, Prediction of protein structural classes, Crit Rev Biochem Mol Biol, 30, 275-349, (1995)
[13] Cortes, C.; Vapnik, V., Support vector machine, Mach Learn, 20, 273-297, (1995) · Zbl 0831.68098
[14] Deschavanne, P.; Tufféry, P., Exploring an alignment free approach for protein classification and structural class prediction, Biochimie, 90, 615-625, (2008)
[15] Deschavanne, PJ; Giron, A.; Vilain, J.; Fagot, G.; Fertil, B., Genomic signature: characterization and classification of species assessed by chaos game representation of sequences, Mol Biol Evolut, 16, 1391-1399, (1999)
[16] Deschavanne, PJ; Giron, A.; Vilain, J.; Fagot, G.; Fertil, B., Genomic signature: characterization and classification of species assessed by chaos game representation of sequences, Mol Biol Evolut, 16, 1391-1399, (1999)
[17] Dill, KA, Theory for the folding and stability of globular proteins, Biochemistry, 24, 1501-1509, (1985)
[18] Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2005) Misc functions of the department of statistics (e1071), tu wien. R package version pp 1-5
[19] Fang, G.; Bhardwaj, N.; Robilotto, R.; Gerstein, MB, Getting started in gene orthology and functional analysis, PLoS Comput Biol, 6, e1000-703, (2010)
[20] Fiser, A.; Tusnady, GE; Simon, I., Chaos game representation of protein structures, J Mol Graph, 12, 302-304, (1994)
[21] Fitch, WM, Distinguishing homologous from analogous proteins, Syst Zool, 19, 99-113, (1970)
[22] Ford, MJ, Molecular evolution of transferrin: evidence for positive selection in salmonids, Mol Biol Evolut, 18, 639-647, (2001)
[23] Hajisharifi, Z.; Piryaiee, M.; Beigi, MM; Behbahani, M.; Mohabatkar, H., Predicting anticancer peptides with chous pseudo amino acid composition and investigating their mutagenicity via ames test, J Theor Biol, 341, 34-40, (2014)
[24] He, P.; Li, X.; Yang, J.; Wang, J., A novel descriptor for protein similarity analysis. MATCH: communications in mathematical and in computer, Chemistry, 65, 445-458, (2011)
[25] He, PA; Zhang, YP; Yao, YH; Tang, YF; Nan, XY, The graphical representation of protein sequences based on the physicochemical properties and its applications, J Comput Chem, 31, 2136-2142, (2010)
[26] He, Pa; Li, D.; Zhang, Y.; Wang, X.; Yao, Y., A 3d graphical representation of protein sequences based on the gray code, J Theor Biol, 304, 81-87, (2012) · Zbl 1397.92528
[27] Hoang, T.; Yin, C.; Yau, SST, Numerical encoding of DNA sequences by chaos game representation with application in similarity comparison, Genomics, 108, 134-142, (2016)
[28] Jeffrey, HJ, Chaos game representation of gene structure, Nucleic Acids Res, 18, 2163-2170, (1990)
[29] Lam, W.; Bacchus, F., Learning Bayesian belief networks: an approach based on the MDL principle, Comput Intell, 10, 269-293, (1994)
[30] Li, FM; Wang, XQ, Identifying anticancer peptides by using improved hybrid compositions, Sci Rep, 6, 33910, (2016)
[31] Li, W.; Godzik, A., Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22, 1658-1659, (2006)
[32] Liao, B.; Liao, B.; Lu, X.; Cao, Z., A novel graphical representation of protein sequences and its application, J Comput Chem, 32, 2539-2544, (2011)
[33] Liu, Y.; Zhang, Y., A new method for analyzing H5N1 avian influenza virus, J Comput Chem, 47, 1129-1144, (2010) · Zbl 1186.92034
[34] Luo, Ry; Feng, Zp; Liu, Jk, Prediction of protein structural class by amino acid and polypeptide composition, Eur J Biochem, 269, 4219-4225, (2002)
[35] Matsuda, S.; Vert, JP; Saigo, H.; Ueda, N.; Toh, H.; Akutsu, T., A novel representation of protein sequences for prediction of subcellular location using support vector machines, Protein Sci, 14, 2804-2813, (2005)
[36] Mu, Z.; Wu, J.; Zhang, Y., A novel method for similarity/dissimilarity analysis of protein sequences, Phys A Stat Mech Appl, 392, 6361-6366, (2013)
[37] Nakashima, H.; Nishikawa, K., Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies, J Mol Biol, 238, 54-61, (1994)
[38] Paradis, E.; Claude, J.; Strimmer, K., Ape: analyses of phylogenetics and evolution in r language, Bioinformatics, 20, 289-290, (2004)
[39] Randić, M.; Novič, M.; Vračko, M., On novel representation of proteins based on amino acid adjacency matrix, SAR QSAR Environ Res, 19, 339-349, (2008)
[40] Robinson, O.; Dylus, D.; Dessimoz, C., Phylo. io: interactive viewing and comparison of large phylogenetic trees on the web, Mol Biol Evolut, 33, 2163-2166, (2016)
[41] Sahu, SS; Panda, G., A novel feature representation method based on chou’s pseudo amino acid composition for protein structural class prediction, Comput Biol Chem, 34, 320-327, (2010) · Zbl 1403.92221
[42] Saitou, N.; Nei, M., The neighbor-joining method: a new method for reconstructing phylogenetic trees, Mol Biol Evolut, 4, 406-425, (1987)
[43] Shamim, MTA; Anwaruddin, M.; Nagarajaram, HA, Support vector machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs, Bioinformatics, 23, 3320-3327, (2007)
[44] Shi, JY; Zhang, SW; Pan, Q.; Zhou, GP, Using Pseudo amino acid composition to predict protein subcellular location: approached with amino acid composition distribution, Amino Acids, 35, 321-327, (2008)
[45] Sievers, F.; Wilm, A.; Dineen, D.; Gibson, TJ; Karplus, K.; Li, W.; Lopez, R.; McWilliam, H.; Remmert, M.; Söding, J.; etal., Fast, scalable generation of high-quality protein multiple sequence alignments using clustal omega, Mol Syst Biol, 7, 539, (2011)
[46] Singh, R.; Xu, J.; Berger, B., Global alignment of multiple protein interaction networks with application to functional orthology detection, Proc Nat Acad Sci, 105, 12,763-12,768, (2008)
[47] Suna, D.; Xua, C.; Zhanga, Y., A novel method of 2d graphical representation for proteins and its application, RNA, 18, 20, (2016)
[48] Tanchotsrinon, W.; Lursinsap, C.; Poovorawan, Y., A high performance prediction of HPV genotypes by Chaos game representation and singular value decomposition, BMC Bioinform, 16, 71, (2015)
[49] Thompson, JD; Higgins, DG; Gibson, TJ, Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res, 22, 4673-4680, (1994)
[50] Tyagi, A.; Kapoor, P.; Kumar, R.; Chaudhary, K.; Gautam, A.; Raghava, G., In silico models for designing and discovering novel anticancer peptides, Sci Rep, 3, 2984, (2013)
[51] Wang, G.; Li, X.; Wang, Z., Apd2: the updated antimicrobial peptide database and its application in peptide design, Nucleic Acids Res, 37, d933-d937, (2008)
[52] Welch, P., The use of fast fourier transform for the estimation of power spectra: a method based on time averaging over short, modified periodograms, IEEE Transact Audio Electroacoust, 15, 70-73, (1967)
[53] Wu, H.; Zhang, Y.; Chen, W.; Mu, Z., Comparative analysis of protein primary sequences with graph energy, Phys A Stat Mech Appl, 437, 249-262, (2015) · Zbl 1400.92618
[54] Xu, C.; Sun, D.; Liu, S.; Zhang, Y., Protein sequence analysis by incorporating modified chaos game and physicochemical properties into chou’s general pseudo amino acid composition, J Theor Biol, 406, 105-115, (2016)
[55] Yang, JY; Peng, ZL; Yu, ZG; Zhang, RJ; Anh, V.; Wang, D., Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation, J Theor Biol, 257, 618-626, (2009) · Zbl 1400.92417
[56] Yao, YH; Dai, Q.; Li, C.; He, PA; Nan, XY; Zhang, YZ, Analysis of similarity/dissimilarity of protein sequences, Proteins Struct Funct Bioinform, 73, 864-871, (2008)
[57] Yau, SST; Yu, C.; He, R., A protein map and its application, DNA and Cell Biol, 27, 241-250, (2008)
[58] Yu, HJ; Huang, DS, Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids, IEEE/ACM Trans Comput Biol Bioinform (TCBB), 10, 457-467, (2013)
[59] Yu, ZG; Anh, V.; Lau, KS, Chaos game representation of protein sequences based on the detailed HP model and their multifractal and correlation analyses, J Theor Biol, 226, 341-348, (2004)
[60] Zhang, L.; Liao, B.; Li, D.; Zhu, W., A novel representation for apoptosis protein subcellular localization prediction using support vector machine, J Theor Biol, 259, 361-365, (2009) · Zbl 1402.92163
[61] Zhang Y, Yu X (2010) Analysis of protein sequence similarity. In: 2010 IEEE fifth international conference on bio-inspired computing: theories and applications (BIC-TA), IEEE, pp 1255-1258
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.