×

zbMATH — the first resource for mathematics

Dforml(KNN)-PseAAC: detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou’s 5-step rule and pseudo components. (English) Zbl 1411.92107
Summary: Formylation is a type of post-translational modification that can occur on lysine sites, which plays an irreplaceable role in organism. To better understand the mechanism, it is necessary to identify formylation sites in proteins accurately. Computational method is popular because of its more convenience and higher speed than traditional experimental methods. However, no computational method has been proposed for prediction of lysine formylation. In this study, we developed a predictor named LFPred to identify lysine formylation sites using sequence features (including amino acid composition (AAC), binary profile features (BPF), and amino acid index (AAI)) combined K-nearest neighbor algorithm as classifier. We chose discrete window instead of continuous window according to information entropy. Besides, we took measure to select more reliable negative samples and address the severe imbalance between positive samples and negative samples. Finally, the performance of LFPred is measured with a specificity of 79.9% and a sensibility of 81.4% using jackknife test method, which indicated that our method can be a useful tool for prediction of lysine formylation sites.
MSC:
92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Hussain, W.; Khan, Y. D.; Rasool, N.; Khan, S. A., SPrenylC-PseAAC: a sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-prenylation sites in proteins, J. Theor. Biol., 468, (2019), 1-11 · Zbl 1411.92233
[2] Behbahani, M.; Mohabatkar, H.; Nosrati, M., Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou’s general pseudo amino acid composition, J. Theor. Biol., 411, 1-5, (2016)
[3] Cao, D. S.; Xu, Q. S.; Liang, Y. Z., Propy: a tool to generate various modes of Chou’s PseAAC, Bioinformatics, 29, 960-962, (2013)
[4] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H., iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget, 8, 4208-4217, (2017)
[5] Chen, W.; Feng, P. M.; Lin, H., iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41, e68, (2013)
[6] Chen, W.; Lei, T. Y.; Jin, D. C.; Lin, H., PseKNC: a flexible web-server for generating pseudo K-tuple nucleotide composition, Anal. Biochem., 456, 53-60, (2014)
[7] Chen, W.; Lin, H., Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences, Mol. BioSyst., 11, 2620-2634, (2015)
[8] Chen, W.; Lin, H.; Feng, P. M.; Ding, C.; Zuo, Y. C., iNuc-PhysChem: a Sequence-based predictor for identifying nucleosomes via physicochemical properties, PLoS One, 7, e47843, (2012)
[9] Chen, X.; Qiu, J. D.; Shi, S. P., Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites, Bioinformatics, 29, 13, 1614-1622, (2013)
[10] Chen, Z.; Chen, Y. Z.; Wang, X. F., Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs, PLoS One, 6, 7, e22930, (2011)
[11] Chen, Z.; Zhou, Y.; Song, J., hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties, Biochimica Et Biophysica Acta, 1834, 8, 1461-1467, (2013)
[12] Cheng, X.; Lin, W. Z.; Xiao, X., pLoc_bal-mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC, Bioinformatics, (2018)
[13] Cheng, X.; Xiao, X., pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC, Gene, 628, 315-321, (2017), (Erratum: ibid., 2018, Vol.644, 156-156)
[14] Cheng, X.; Xiao, X., pLoc-mPlant: predict subcellular localization of multi-location plant proteins via incorporating the optimal GO information into general PseAAC, Mol. BioSyst., 13, 1722-1727, (2017)
[15] Cheng, X.; Xiao, X., pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information, Bioinformatics, 34, 1448-1456, (2018)
[16] Cheng, X.; Zhao, S. G.; Lin, W. Z.; Xiao, X., pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics, 33, 3524-3531, (2017)
[17] Cheng, X.; Zhao, S. G.; Xiao, X., iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals, Oncotarget, 8, 58494-58503, (2017)
[18] Cheng, X.; Zhao, S. G.; Xiao, X., iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics, 33, 341-346, (2017), (Corrigendum, ibid., 2017, Vol.33, 2610)
[19] Chou, K. C., Prediction of signal peptides using scaled window, Peptides, 22, 12, 1973-1979, (2001)
[20] Chou, K. C., Using subsite coupling to predict signal peptides, Protein Eng., 14, 75-79, (2001)
[21] Chou, K. C., Prediction of protein signal sequences and their cleavage sites, Proteins, 42, 136-139, (2001)
[22] Chou, K. C., Prediction of protein cellular attributes using pseudo amino acid composition, Proteins, 43, 246-255, (2001), (Erratum: ibid., 2001, Vol.44, 60)
[23] Chou, K. C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 10-19, (2005)
[24] Chou, K. C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. Proteom., 6, 262-274, (2009)
[25] Chou, K. C., Graphic rule for drug metabolism systems, Curr. Drug Metab., 11, 369-378, (2010)
[26] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[27] Chou, K. C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst., 9, 1092-1100, (2013)
[28] Chou, K. C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 218-234, (2015)
[29] Chou, K. C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Topics Med. Chem., 17, 2337-2358, (2017)
[30] Chou, K. C.; Shen, H. B., Recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 1, 63-92, (2009)
[31] Consortium, U. P., The universal protein resource (UniProt), Nucleic Acids Res., 33, 1, D154-D159, (2005)
[32] Dehzangi, A.; Heffernan, R.; Sharma, A.; Lyons, J.; Paliwal, K.; Sattar, A., Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou’s general PseAAC, J. Theor. Biol., 364, 284-294, (2015) · Zbl 1405.92092
[33] Deng, W.; Wang, C.; Zhang, Y., GPS-PAIL: prediction of lysine acetyltransferase-specific modification sites from protein sequences, Sci. Rep., 6, 39787, (2016)
[34] Du, P.; Gu, S.; Jiao, Y., PseAAC-General: fast building various modes of general form of Chou’s pseudo amino acid composition for large-scale protein datasets, Int. J. Mol. Sci., 15, 3495-3506, (2014)
[35] Du, P.; Wang, X.; Xu, C.; Gao, Y., PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou’s pseudo amino acid compositions, Anal. Biochem., 425, 117-119, (2012)
[36] Feng, P.; Ding, H.; Yang, H.; Chen, W.; Lin, H., iRNA-PseColl: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC, Mol. Ther. Nucleic Acids, 7, 155-163, (2017)
[37] Fu, L.; Niu, B.; Zhu, Z., CD-HIT: accelerated for clustering the next-generation sequencing data, Bioinformatics, 28, 23, 3150-3152, (2012)
[38] Guodong, C.; Man, C.; Kun, L., ProAcePred: prokaryote lysine acetylation sites prediction based on elastic net feature optimization, Bioinformatics, 34, 3999-4006, (2018)
[39] Hasan, M. M.; Khatun, M. S.; Mollah, M. N.H., A systematic identification of species-specific protein succinylation sites using joint element features information, Int. J. Nanomed., 12, 6303-6315, (2017)
[40] Hou, T.; Zheng, G.; Zhang, P., LAceP: lysine acetylation site prediction using logistic regression classifiers, PLoS One, 9, 2, e89575, (2014)
[41] Hu, L.; Li, Z.; Wang, K.; Niu, S.; Shi, X.; Cai, Y.; Li, H., Prediction and analysis of protein methylarginine and methyllysine based on multisequence features, Biopolymers, 95, 11, 763-771, (2011)
[42] Hussain, W.; Khan, Y. D.; Rasool, N.; Khan, S. A., SPalmitoylC-PseAAC: a sequence-based model developed via Chou’s 5-steps rule and general PseAAC for identifying S-palmitoylation sites in proteins, Anal. Biochem., 568, 14-23, (2019)
[43] Ijaz, A., SUMOhunt: combining spatial staging between lysine and SUMO with random forests to predict SUMOylation, ISRN Bioinform., 2013, (2013)
[44] Jia, J.; Liu, Z.; Xiao, X., pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach, J. Theor. Biol., 394, 223-230, (2016) · Zbl 1343.92153
[45] Jia, J.; Liu, Z.; Xiao, X., iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequencecoupling effects into pseudo components and optimizing imbalanced training data set, Anal. Biochem., 497, 48-56, (2016)
[46] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., iCar-PseCp: identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC, Oncotarget, 7, 34558-34570, (2016)
[47] Jia, J.; Zhang, L.; Liu, Z.; Xiao, X., pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC, Bioinformatics, 32, 3133-3141, (2016)
[48] Ju, Z.; Cao, J. Z., Prediction of protein N-formylation using the composition of k -spaced amino acid pairs, Anal. Biochem., 534, 40-45, (2017)
[49] Kawashima, S., AAindex: amino acid index database, progress report 2008, Nucleic Acids Res., 36, D202-D205, (2007)
[50] Lee, T. Y.; Huang, H. D.; Hung, J. H., dbPTM: an information repository of protein post-translational modification, Nucleic Acids Res., 34, D622-D627, (2006), Database issue
[51] Lee, T. Y.; Chang, C. W.; Lu, C. T., Identification and characterization of lysine-methylated sites on histones and nonhistone proteins, Comput. Biol. Chem., 50, 11-18, (2014)
[52] Le-Le, H.; Zhen, L.; Wang, K., Prediction and analysis of protein methylarginine and methyllysine based on Multisequence features, Biopolymers, 95, 11, 763-771, (2011); Chen, Z.; Chen, Y. Z.; Wang, X. F., Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs, PLoS One, 6, 7, e22930, (2011)
[53] Li, A.; Xue, Y.; Jin, C., Prediction of Nepsilon-acetylation on internal lysines implemented in Bayesian discriminant method, Biochem. Biophys. Res. Commun., 350, 818-824, (2006)
[54] Li, S.; Li, H.; Li, M., Improved prediction of lysine acetylation by support vector machines, Protein Pept. Lett., 16, 8, 977-983, (2009)
[55] Li, Y.; Wang, M.; Wang, H., Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features, Sci. Rep., 2014, 4, (2014)
[56] Liu, B., BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches, Brief. Bioinform., (2017)
[57] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nucleic Acids Res., 43, W65-W71, (2015)
[58] Liu, B.; Weng, F.; Huang, D. S., iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC, Bioinformatics, 34, 3086-3093, (2018)
[59] Liu, B.; Wu, H., Pse-in-One 2.0: an improved package of web servers for generating various modes of pseudo components of DNA, RNA, and protein sequences, Nat. Sci., 9, 67-91, (2017)
[60] Liu, B.; Zhang, D.; Xu, R.; Xu, J.; Wang, X.; Chen, Q.; Dong, Q., Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection, Bioinformatics, 30, 472-479, (2014)
[61] Lu, C. T.; Lee, T. Y.; Chen, Y. J., An Intelligent system for identifying acetylated lysine on histones and nonhistone proteins, Biomed. Res. Int., 2014, 2014, (2015)
[62] Min, J. L.; Xiao, X., iEzy-Drug: a web server for identifying the interaction between enzymes and drugs in cellular networking, BioMed. Res. Int., 2013, 1-13, (2013)
[63] Nagpal, G., Computer-aided designing of immunosuppressive peptides based on IL-10 inducing potential, Sci. Rep ., 7, 42851, (2017)
[64] Qiao, N.; Xiaosa, Z.; Lingling, B., Detecting Succinylation sites from protein sequences using ensemble support vector machine, BMC Bioinform., 19, 1, 237, (2018)
[65] Qiu, W. R.; Xiao, X.; Lin, W. Z., iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model, J. Biomol. Struct.Dynam., 33, 8, 1731-1742, (2015)
[66] Qiu, W. R.; Jiang, S. Y.; Sun, B. Q.; Xiao, X.; Cheng, X., iRNA-2methyl: identify RNA 2′-O-methylation sites by incorporating sequence-coupled effects into general PseKNC and ensemble classifier, Med. Chem., 13, 734-743, (2017)
[67] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C., iPTM-mLys: identifying multiple lysine PTM sites and their different types, Bioinform., 32, 3116-3123, (2016)
[68] Qiu, W. R.; Xiao, X.; Lin, W. Z., iMethyl-PseAAC: identification of protein methylation sites via a pseudo amino acid composition approach, Biomed. Res. Int., 2014, (2014)
[69] Yadav, S.; Gupta, M.; Bist, A. S., Prediction of ubiquitination sites using UbiNets, Adv. Fuzzy Syst., (2018)
[70] Shannon, C., The mathematical theory of communication. 1963, M.D. Comput Comput. Med. Pract., 14, 4, 306-317, (1997)
[71] Sheng-Bao, S.; Jian-Ding, Q.; Shao-Ping, S., Position-specific analysis and prediction for protein lysine acetylation based on multiple features, PLoS ONE, 7, 11, e49108, (2012)
[72] Shi, S. P.; Qiu, J. D.; Sun, X. Y., PMeS: prediction of methylation sites based on enhanced feature encoding scheme, PLoS One, 7, e38772, (2012)
[73] Teng, S.; Luo, H.; Wang, L., Predicting protein sumoylation sites from sequence features, Amino Acids, 43, 1, 447-455, (2012)
[74] Vens, C., Identifying discriminative classification-based motifs in biological sequences, Bioinformatics, 27, 1231-1238, (2011)
[75] Wang, J. R.; Huang, W. L.; Tsai, M. J., ESA-UbiSite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives, Bioinformatics, 33, 5, 661, (2017)
[76] Wei, Z. S.; Yang, J. Y.; Shen, H. B., A cascade random forests algorithm for predicting protein-protein interaction sites, IEEE Trans. Nanobiosci., 14, 7, 746-760, (2015)
[77] Wen, P. P.; Shi, S. P.; Xu, H. D., Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization, Bioinformatics, 32, 3107-3115, (2016)
[78] Wiśniewski, J. R.; Zougman, A.; Mann, M., N ε -Formylation of lysine is a widespread post-translational modification of nuclear proteins occurring at residues involved in regulation of chromatin function, Nucleic Acids Res., 36, 2, 570-577, (2008)
[79] Xiao, X.; Cheng, X.; Chen, G.; Mao, Q., pLoc_bal-mGpos: predict subcellular localization of Gram-positive bacterial proteins by quasi-balancing training dataset and PseAAC, Genomics, (2018) · Zbl 1406.92173
[80] Xiao, X.; Min, J. L.; Wang, P., iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints, J. Theor. Biol., 337C, 71-79, (2013)
[81] Xiao, X.; Ye, H. X.; Liu, Z.; Jia, J. H., iROS-gPseKNC: predicting replication origin sites in DNA by incorporating dinucleotide position-specific propensity into general pseudo nucleotide composition, Oncotarget, 7, 34180-34189, (2016)
[82] Xie, H. L.; Fu, L.; Nie, X. D., Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou’s PseAAC, Protein Eng. Des. Sel., 26, 735-742, (2013)
[83] Xu, H.; Zhou, J.; Lin, S., PLMD:an updated data resource of protein lysine modifications, J. Genet. Genom., 44, 5, 243-250, (2017)
[84] Xu, Y.; Ding, Y. X.; Ding, J., iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity, Sci. Rep., 5, 10184, (2015); Yavuz, A. S.; Sezerman, O. U., Predicting sumoylation sites using support vector machines based on various sequence features, conformational flexibility and disorder, BMC Genom., 15, suppl 9, S18, (2014)
[85] Xu, Y.; Wang, X. B.; Ding, J., Lysine acetylation sites prediction using an ensemble of support vector machine classifiers, J. Theor. Biol., 264, 1, 130-135, (2010) · Zbl 1406.92223
[86] Xu, Y.; Ding, J.; Wu, L. Y., iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition, PLoS ONE, 8, e55844, (2013)
[87] Xu, Y.; Shao, X. J.; Wu, L. Y.; Deng, N. Y., iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, PeerJ., 1, e171, (2013)
[88] Xu, Y.; Wen, X.; Shao, X. J.; Deng, N. Y., iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition, Int. J. Mol. Sci., 15, 7594-7610, (2014)
[89] Xu, Y.; Wen, X.; Wen, L. S.; Wu, L. Y.; Deng, N. Y., iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition, PLoS One, 9, (2014)
[90] Zhang, J.; Zhao, X.; Sun, P.; Ma, Z., PSNO: predicting cysteine S-nitrosylation sites by incorporating various sequence-derived features into the general form of Chou’s PseAAC, Int. J. Mol. Sci., 15, 11204-11219, (2014)
[91] Zhou, G. P., The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism, J. Theor. Biol., 284, 142-148, (2011) · Zbl 1397.92245
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.