pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. (English) Zbl 1343.92153

Summary: Being one type of post-translational modifications (PTMs), protein lysine succinylation is important in regulating varieties of biological processes. It is also involved with some diseases, however. Consequently, from the angles of both basic research and drug development, we are facing a challenging problem: for an uncharacterized protein sequence having many Lys residues therein, which ones can be succinylated, and which ones cannot? To address this problem, we have developed a predictor called pSuc-Lys through (1) incorporating the sequence-coupled information into the general pseudo amino acid composition, (2) balancing out skewed training dataset by random sampling, and (3) constructing an ensemble predictor by fusing a series of individual random forest classifiers. Rigorous cross-validations indicated that it remarkably outperformed the existing methods. A user-friendly web-server for pSuc-Lys has been established at http://www.jci-bioinfo.cn/pSuc-Lys, by which users can easily obtain their desired results without the need to go through the complicated mathematical equations involved. It has not escaped our notice that the formulation and approach presented here can also be used to analyze many other problems in computational proteomics.


92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
92-04 Software, source code, etc. for problems pertaining to biology
Full Text: DOI


[1] Ahmad, S.; Kabir, M.; Hayat, M., Identification of heat shock protein families and J-protein types by incorporating dipeptide composition into chou׳s general pseaac, Comput. Methods Programs Biomed., 122, 165-174, (2015)
[2] Althaus, I. W.; Gonzales, A. J.; Chou, J. J.; Diebel, M. R.; Romero, D. L.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F., The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase, J. Biol. Chem., 268, 14875-14880, (1993)
[3] Althaus, I. W.; Chou, J. J.; Gonzales, A. J.; Kezdy, F. J.; Romero, D. L.; Aristoff, P. A.; Tarpley, W. G.; Reusser, F., Kinetic studies with the nonnucleoside HIV-1 reverse transcriptase inhibitor U-88204E, Biochemistry, 32, 6548-6554, (1993)
[4] Breiman, L., Random forests, Mach. Learn., 45, 5-32, (2001) · Zbl 1007.68152
[5] Cai, Y. D., Prediction of membrane protein types by incorporating amphipathic effects, J. Chem. Inf. Model., 45, 407-413, (2005)
[6] Cao, D. S.; Xu, Q. S.; Liang, Y. Z., Propy: a tool to generate various modes of chou׳s pseaac, Bioinformatics, 29, 960-962, (2013)
[7] Chen, J.; Liu, H.; Yang, J., Prediction of linear B-cell epitopes using amino acid pair antigenicity scale, Amino Acids, 33, 423-428, (2007)
[8] Chen, W.; Lin, H., Pseudo nucleotide composition or pseknc: an effective formulation for analyzing genomic sequences, Mol. Biosyst., 11, 2620-2634, (2015)
[9] Chen, W.; Feng, P. M.; Lin, H., Irspot-psednc: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41, e68, (2013)
[10] Chen, W.; Feng, P. M.; Lin, H., Iss-psednc: identifying splicing sites using pseudo dinucleotide composition, Biomed. Res. Int., 2014, 623149, (2014)
[11] Chen, W.; Feng, P. M.; Deng, E. Z., Itis-psetnc: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition, Anal. Biochem., 462, 76-83, (2014)
[12] Chen, W.; Lei, T. Y.; Jin, D. C., Pseknc: a flexible web-server for generating pseudo K-tuple nucleotide composition, Anal. Biochem., 456, 53-60, (2014)
[13] Chen, W.; Zhang, X.; Brooker, J.; Lin, H., Pseknc-general: a cross-platform package for generating various modes of pseudo nucleotide compositions, Bioinformatics, 31, 119-120, (2015)
[14] Chen, W.; Feng, P.; Ding, H.; Lin, H., Irna-methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition, Anal. Biochem., 490, 26-33, (2015), (also, Data in Brief, 2015, 5: 376-378)
[15] Chou, K. C., Graphic rules in steady and non-steady enzyme kinetics, J.Biol. Chem., 264, 12074-12079, (1989)
[16] Chou, K. C., A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins, J. Biol. Chem., 268, 16938-16948, (1993)
[17] Chou, K. C., A sequence-coupled vector-projection model for predicting the specificity of galnac-transferase, Protein Sci., 4, 1365-1383, (1995)
[18] Chou, K. C., Review: prediction of human immunodeficiency virus protease cleavage sites in proteins, Anal. Biochem., 233, 1-14, (1996)
[19] Chou, K. C., Prediction of protein cellular attributes using pseudo amino acid composition, Proteins: Struct. Funct. Genet., 43, 246-255, (2001), (Erratum: ibid., 2001, Vol. 44, p. 60)
[20] Chou, K. C., Prediction of protein signal sequences and their cleavage sites, Proteins: Struct. Funct. Genet., 42, 136-139, (2001)
[21] Chou, K. C., Using subsite coupling to predict signal peptides, Protein Eng., 14, 75-79, (2001)
[22] Chou, K. C., Prediction of signal peptides using scaled window, Peptides, 22, 1973-1979, (2001)
[23] Chou, K. C., Review: prediction of protein signal sequences, Curr. Protein Pept. Sci., 3, 615-622, (2002)
[24] Chou, K. C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 10-19, (2005)
[25] Chou, K. C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. Proteom., 6, 262-274, (2009)
[26] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th anniversary year review), J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[27] Chou, K. C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst., 9, 1092-1100, (2013)
[28] Chou, K. C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 218-234, (2015)
[29] Chou, K. C.; Zhang, C. T., Review: prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol., 30, 275-349, (1995)
[30] Chou, K. C.; Cai, Y. D., Prediction and classification of protein subcellular location: sequence-order effect and pseudo amino acid composition, J. Cell. Biochem., 90, 1250-1260, (2003), (Addendum, ibid. 2004, 91, 1085)
[31] Chou, K. C.; Shen, H. B., Predicting protein subcellular location by fusing multiple classifiers, J. Cell. Biochem., 99, 517-527, (2006)
[32] Chou, K. C.; Shen, H. B., Review: recent progresses in protein subcellular location prediction, Anal. Biochem., 370, 1-16, (2007)
[33] Chou, K. C.; Shen, H. B., Protident: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information, Biochem. Biophys. Res. Commun., 376, 321-325, (2008)
[34] Davis, J.; Goadrich, M., The relationship between precision-recall and ROC curves, Proceedings of the 23rd International Conference on Machine Learning, 233-240, (2006), ACM
[35] Dehzangi, A.; Heffernan, R.; Sharma, A.; Lyons, J.; Paliwal, K.; Sattar, A., Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into chou׳s general pseaac, J. Theor. Biol., 364, 284-294, (2015) · Zbl 1405.92092
[36] Ding, H.; Deng, E. Z.; Yuan, L. F.; Liu, L.; Lin, H., Ictx-type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels, Biomed. Res. Int., 2014, 286419, (2014)
[37] Du, J.; Zhou, Y.; Su, X.; Yu, J. J.; Khan, S.; Jiang, H.; Kim, J.; Woo, J.; Kim, J. H.; Choi, B. H., Sirt5 is a NAD-dependent protein lysine demalonylase and desuccinylase, Science, 334, 806-809, (2011)
[38] Du, P.; Gu, S.; Jiao, Y., Pseaac-general: fast building various modes of general form of chou׳s pseudo-amino acid composition for large-scale protein datasets, Int. J. Mol. Sci., 15, 3495-3506, (2014)
[39] Du, P.; Wang, X.; Xu, C.; Gao, Y., Pseaac-builder: a cross-platform stand-alone program for generating various special chou׳s pseudo-amino acid compositions, Anal. Biochem., 425, 117-119, (2012)
[40] Fan, G. L.; Zhang, X. Y.; Liu, Y. L.; Nang, Y.; Wang, H., DSPMP: discriminating secretory proteins of malaria parasite by hybridizing different descriptors of chou׳s pseudo amino acid patterns, J Comput. Chem., 36, 2317-2327, (2015)
[41] Fawcett, J. A., An introduction to ROC analysis, Pattern Recognit. Lett., 27, 861-874, (2005)
[42] Forsen, S., Graphical rules for enzyme-catalyzed rate laws, Biochem. J., 187, 829-835, (1980)
[43] Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W., CD-HIT: accelerated for clustering the next-generation sequencing data, Bioinformatics, 28, 3150-3152, (2012)
[44] Huang, C.; Yuan, J. Q., Simultaneously identify three different attributes of proteins by fusing their three different modes of chou׳s pseudo amino acid compositions, Protein Pept. Lett., 22, 547-556, (2015)
[45] Jia, J.; Liu, Z.; Xiao, X., Ippi-esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac, J. Theor. Biol., 377, 47-56, (2015)
[46] Jia, J.; Xiao, X.; Liu, B.; Jiao, L., Bagging-based spectral clustering ensemble selection, Pattern Recognit. Lett., 32, 1456-1467, (2011)
[47] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition (ippbs-pseaac), J. Biomol. Struct. Dyn., (2015)
[48] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., Ippbs-opt: A sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets, Molecules, 21, 95, (2016)
[49] Kandaswamy, K. K.; Martinetz, T.; Moller, S.; Suganthan, P. N.; Sridharan, S.; Pugalenthi, G., AFP-pred: a random forest approach for predicting antifreeze proteins from sequence-derived properties, J. Theor. Biol., 270, 56-62, (2011)
[50] Khan, Z. U.; Hayat, M.; Khan, M. A., Discrimination of acidic and alkaline enzyme using chou׳s pseudo amino acid composition in conjunction with probabilistic neural network model, J. Theor. Biol., 365, 197-203, (2015) · Zbl 1314.92069
[51] Kumar, R.; Srivastava, A.; Kumari, B.; Kumar, M., Prediction of beta-lactamase and its class by chou׳s pseudo-amino acid composition and support vector machine, J. Theor. Biol., 365, 96-103, (2015) · Zbl 1314.92055
[52] Lin, H.; Deng, E. Z.; Ding, H.; Chen, W.; Chou, K. C., Ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition, Nucleic Acids Res., 42, 12961-12972, (2014)
[53] Lin, S. X.; Lapointe, J., Theoretical and experimental biology in one—a symposium in honour of Professor kuo-Chen chou׳s 50th anniversary and Professor richard giegé׳s 40th anniversary of their scientific careers, J. Biomed. Sci. and Eng., 6, 435-442, (2013)
[54] Lin, W. Z.; Xiao, X., Wenxiang: a web-server for drawing wenxiang diagrams, Nat. Sci., 3, 862-865, (2011)
[55] Lin, W. Z.; Fang, J. A.; Xiao, X., Idna-prot: identification of DNA binding proteins using random forest with grey model, PLoS ONE, 6, e24756, (2011)
[56] Lin, W. Z.; Fang, J. A.; Xiao, X., Iloc-animal: A multi-label learning classifier for predicting subcellular localization of animal proteins, Mol. Biosyst., 9, 634-644, (2013)
[57] Liu, B.; Fang, L.; Wang, S.; Wang, X., Identification of microrna precursor with the degenerate K-tuple or kmer strategy, J. Theor. Biol., 385, 153-159, (2015)
[58] Liu, B.; Fang, L.; Long, R.; Lan, X., Ienhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, (2015)
[59] Liu, B.; Liu, F.; Fang, L.; Wang, X., Repdna: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects, Bioinformatics, 31, 1307-1309, (2015)
[60] Liu, B.; Fang, L.; Liu, F.; Wang, X., Imirna-psedpc: microrna precursor identification with a pseudo distance-pair composition approach, J. Biomol. Struct. Dyn., (2015)
[61] Liu, B.; Fang, L.; Liu, F.; Wang, X.; Chen, J., Identification of real microrna precursors with a pseudo structure status composition approach, PLoS ONE, 10, e0121501, (2015)
[62] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K. C., Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res., 43, W65-W71, (2015)
[63] Liu, D. Q.; Liu, H.; Shen, H. B.; Yang, J., Predicting secretory protein signal sequence cleavage sites by fusing the marks of global alignments, Amino Acids, 32, 493-496, (2007)
[64] Liu, Z.; Xiao, X.; Qiu, W. R., Idna-methyl: identifying DNA methylation sites via pseudo trinucleotide composition, Anal. Biochem., 474, 69-77, (2015), (also, Data in Brief, 2015, 4: 87-89)
[65] Liu, Z.; Wang, Y.; Gao, T.; Pan, Z.; Cheng, H.; Yang, Q.; Cheng, Z.; Guo, A.; Ren, J.; Xue, Y., CPLM: a database of protein lysine modifications, Nucleic Acids Res., 42, D531-D536, (2014)
[66] Mandal, M.; Mukhopadhyay, A.; Maulik, U., Prediction of protein subcellular localization by incorporating multiobjective PSO-based feature subset selection into the general form of chou׳s pseaac, Med. Biol. Eng. Comput., 53, 331-344, (2015)
[67] Mondal, S.; Pai, P. P., Chou׳s pseudo amino acid composition improves sequence-based antifreeze protein prediction, J. Theor. Biol., 356, 30-35, (2014)
[68] Park, J.; Chen, Y.; Tishkoff, D. X.; Peng, C.; Tan, M.; Dai, L.; Xie, Z.; Zhang, Y.; Zwaans, B. M.M.; Skinner, M. E., SIRT5-mediated lysine desuccinylation impacts diverse metabolic pathways, Mol. Cell, 50, 919-930, (2013)
[69] Pugalenthi, G.; Kandaswamy, K. K.; Kolatkar, P., RSARF: prediction of residue solvent accessibility from protein sequence using random forest method, Protein Pept. Lett., 19, 50-56, (2012)
[70] Qiu, W. R.; Xiao, X., Irspot-tncpseaac: identify recombination spots with trinucleotide composition and pseudo amino acid components, Int. J. Mol. Sci., 15, 1746-1766, (2014)
[71] Qiu, W. R.; Xiao, X.; Lin, W. Z., Imethyl-pseaac: identification of protein methylation sites via a pseudo amino acid composition approach, Biomed Res. Int., 2014, 947416, (2014)
[72] Sanchez, V.; Peinado, A. M.; Perez-Cordoba, J. L.; Gomez, A. M., A new signal characterization and signal-based chou׳s pseaac representation of protein sequences, J. Bioinform. Comput. Biol., 1550024, (2015)
[73] Shen, H. B., Hum-ploc: a novel ensemble classifier for predicting human protein subcellular localization, Biochem. Biophys. Res. Commun., 347, 150-157, (2006)
[74] Shen, H. B., Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-nearest neighbor classifiers, J. Proteom. Res., 5, 1888-1897, (2006)
[75] Shen, H. B., Nuc-ploc: a new web-server for predicting protein subnuclear localization by fusing pseaa composition and psepssm, Protein Eng. Des. Sel., 20, 561-567, (2007)
[76] Shen, H. B., Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides, Biochem. Biophys. Res. Commun., 357, 633-640, (2007)
[77] Shen, H. B., Euk-mploc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites, J. Proteome Res., 6, 1728-1734, (2007)
[78] Shen, H. B., Virus-ploc: A fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells, Biopolymers, 85, 233-240, (2007)
[79] Shen, H. B., Quatident: a web server for identifying protein quaternary structural attribute by fusing functional domain and sequential evolution information, J. Proteome Res., 8, 1577-1584, (2009)
[80] Shen, H. B., Virus-mploc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites, J. Biomol. Struct. Dyn., 28, 175-186, (2010)
[81] Shen, H. B.; Yang, J., Euk-ploc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction, Amino Acids, 33, 57-67, (2007)
[82] Shen, H. B.; Song, J. N., Prediction of protein folding rates from primary sequence by fusing multiple sequential features, J. Biomed. Sci. Eng., 2, 136-143, (2009)
[83] Sun, Y.; Wong, A. K.; Kamel, M. S., Classification of imbalanced data: a review, Int. J. Pattern Recognit. Artif. Intell., 23, 687-719, (2009)
[84] Tomasselli, A. L.; Reardon, I. M.; Heinrikson, R. L., Predicting HIV protease cleavage sites in proteins by a discriminant function method, Proteins: Struct. Funct. Genet., 24, 51-72, (1996)
[85] UniProt Consortium, The universal protein resource (uniprot) in 2010, Nucleic acids Res., 38, D142-D148, (2010)
[86] Walsh, C. T.; Garneau-Tsodikova, S.; Gatto, G. J., Protein posttranslational modifications: the chemistry of proteome diversifications, Angew. Chem. Int. Ed., 44, 7342-7372, (2005)
[87] Wang, X.; Zhang, W.; Zhang, Q.; Li, G. Z., Multip-schlo: multi-label protein subchloroplast localization prediction with chou׳s pseudo amino acid composition and a novel multi-label classifier, Bioinformatics, 31, 2639-2645, (2015)
[88] Witze, E. S.; Old, W. M.; Resing, K. A.; Ahn, N. G., Mapping protein post-translational modifications with mass spectrometry, Nat. Methods, 4, 798-806, (2007)
[89] Wu, Z. C.; Xiao, X., 2D MH: a web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids, J. Theor. Biol., 267, 29-34, (2010)
[90] Wu, Z. C.; Xiao, X., Iloc-hum: using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. Biosyst., 8, 629-641, (2012)
[91] Xiao, X.; Wu, Z. C., Iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol., 284, 42-51, (2011) · Zbl 1397.92238
[92] Xiao, X.; Wang, P.; Lin, W. Z.; Jia, J. H., Iamp-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types, Anal. Biochem., 436, 168-177, (2013)
[93] Xiao, X.; Min, J. L.; Lin, W. Z.; Liu, Z., Idrug-target: predicting the interactions between drug compounds and target proteins in cellular networking via the benchmark dataset optimization approach, J. Biomol. Struct. Dyn., 33, 2221-2233, (2015)
[94] Xie, Z.; Dai, J.; Dai, L.; Tan, M.; Cheng, Z.; Wu, Y.; Boeke, J. D.; Zhao, Y., Lysine succinylation and lysine malonylation in histones, Mol. Cell. Proteom., 11, 100-107, (2012)
[95] Xu, H. D.; Shi, S. P.; Wen, P. P.; Qiu, J. D., Succfind: a novel succinylation sites online prediction tool via enhanced characteristic strategy, Bioinformatics, (2015)
[96] Xu, Y.; Ding, J.; Wu, L. Y., Isno-pseaac: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS ONE, 8, e55844, (2013)
[97] Xu, Y.; Wen, X.; Shao, X. J.; Deng, N. Y., Ihyd-pseaac: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition, Int. J. Mol. Sci., 15, 7594-7610, (2014)
[98] Xu, Y.; Wen, X.; Wen, L. S.; Wu, L. Y.; Deng, N. Y., Initro-tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition, PLoS ONE, 9, e105018, (2014)
[99] Xu, Y.; Ding, Y.-X.; Ding, J.; Lei, Y.-H.; Wu, L.-Y.; Deng, N.-Y., Isuc-pseaac: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity, Sci. Rep., 5, (2015)
[100] Zhang, C. T., An alternate-subsite-coupled model for predicting HIV protease cleavage sites in proteins, Protein Eng., 7, 65-73, (1993)
[101] Zhang, C. T., Studies on the specificity of HIV protease: an application of Markov chain theory, J. Protein Chem., 12, 709-724, (1993)
[102] Zhang, C. T.; Kezdy, F. J., A vector approach to predicting HIV protease cleavage sites in proteins, Proteins: Struct. Funct. Genet., 16, 195-204, (1993)
[103] Zhang, C. T.; Kezdy, F. J.; Poorman, R. A., A vector projection method for predicting the specificity of galnac-transferase, Proteins: Struct. Funct. Genet., 21, 118-126, (1995)
[104] Zhang, Z.; Tan, M.; Xie, Z.; Dai, L.; Chen, Y.; Zhao, Y., Identification of lysine succinylation as a new post-translational modification, Nat. Chem. Biol., 7, 58-63, (2011)
[105] Zhao, X.; Ning, Q.; Chai, H.; Ma, Z., Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique, J. Theor. Biol., 374, 60-65, (2015) · Zbl 1341.92023
[106] Zhong, W. Z.; Zhou, S. F., Molecular science for drug development and biomedicine, Int. J. Mol. Sci., 15, 20072-20078, (2014)
[107] Zhou, G. P., The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism, J. Theor. Biol., 284, 142-148, (2011) · Zbl 1397.92245
[108] Zhou, G. P.; Deng, M. H., An extension of chou׳s graphic rules for deriving enzyme kinetic equations to systems involving parallel reaction pathways, Biochem. J., 222, 169-176, (1984)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.