Identifying N\(^6\)-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer. (English) Zbl 1409.92184

Summary: N\(^6\)-methyladenosine (m\(^6\)A) is the one of the most important RNA modifications, playing the role of splicing events, mRNA exporting and stability to cell differentiation. Because of wide distribution of m\(^6\)A in genes, identification of m\(^6\)A sites in RNA sequences has significant importance for basic biomedical research and drug development. High-throughput laboratory methods are time consuming and costly. Nowadays, effective computational methods are much desirable because of its convenience and fast speed. Thus, in this article, we proposed a new method to improve the performance of the m\(^6\)A prediction by using the combined features of deep features and original features with extreme gradient boosting optimized by particle swarm optimization (PXGB). The proposed PXGB algorithm uses three kinds of features, i.e., position-specific nucleotide propensity (PSNP), position-specific dinucleotide propensity (PSDP), and the traditional nucleotide composition (NC). By 10-fold cross validation, the performance of PXGB was measured with an AUC of 0.8390 and an MCC of 0.5234. Additionally, PXGB was compared with the existing methods, and the higher MCC and AUC of PXGB demonstrated that PXGB was effective to predict m\(^6\)A sites. The predictor proposed in this study might help to predict more m6A sites and guide related experimental validation.


92D20 Protein sequences, DNA sequences
90C59 Approximation methods and heuristics in mathematical programming
Full Text: DOI


[1] Akbar, S.; Hayat, M., iMethyl-STTNC: identification of N(6)-methyladenosine sites by extending the Idea of SAAC into Chou’s PseAAC to formulate RNA sequences, J Theor Biol, 455, 205-211, (2018)
[2] Alarcón, C. R., N6-methyladenosine marks primary microRNAs for processing, Nature, 519, 482-485, (2015)
[3] Bengio, Y., Greedy layer-wise training of deep networks, Adv. Neural Inf. Process. Syst., 19, 153-160, (2007)
[4] Cai, L.; Huang, T.; Su, J.; Zhang, X.; Chen, W.; Zhang, F.; He, L., Implications of newly identified brain eQTL genes and their interactors in Schizophrenia, Mol. Ther. Nucleic Acids, 12, 433-442, (2018)
[5] Chen T., Tong H., Benesty M., et al., 2016. XGBoost: Extreme gradient boosting.
[6] Chen, Tianqi, Guestrin, C., 2016. “XGBoost: A scalable tree boosting system.”
[7] Chen, W.; Ding, H.; Zhou, X.; Lin, H., iRNA(m6A)-PseDNC: identifying N6-methyladenosine sites using pseudo dinucleotide composition, Anal. Biochem., 561-562, 59-65, (2018)
[8] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H., iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences, Oncotarget, 8, 4208-4217, (2017)
[9] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H., iRNA-3typeA: identifying 3-types of modification at RNA’s adenosine sites, Mol. Ther. Nucleic Acid, 11, 468-474, (2018)
[10] Chen, W.; Feng, P. M.; Lin, H., iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41, e68, (2013)
[11] Chen, W.; Tang, H.; Ye, J.; Lin, H., iRNA-PseU: Identifying RNA pseudouridine sites, Mol. Ther. Nucleic Acid, 5, e332, (2016)
[12] Chen, W.; Feng, P.; Ding, H.; Lin, H.; Chou, K. C., iRNA-Methyl: identifying N6-methyladenosine sites using pseudo nucleotide composition, Anal. Biochem., 490, 26-33, (2015)
[13] Chen, Wei, Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome, Sci. Rep., 5, 13859, (2015)
[14] Cheng, X.; Xiao, X., pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC, Gene (Erratum: ibid., 2018, Vol.644, 156-156), 628, 315-321, (2017)
[15] Cheng, X.; Xiao, X., pLoc-mPlant: predict subcellular localization of multi-location plant proteins via incorporating the optimal GO information into general PseAAC, Mol. Biosyst., 13, 1722-1727, (2017)
[16] Cheng, X.; Xiao, X., pLoc-mHum: predict subcellular localization of multi-location human proteins via general PseAAC to winnow out the crucial GO information, Bioinformatics, 34, 1448-1456, (2018)
[17] Cheng, X.; Xiao, X., pLoc-mGneg: predict subcellular localization of Gram-negative bacterial proteins by deep gene ontology learning via general PseAAC, Genomics, 110, 231-239, (2018)
[18] Cheng, X.; Xiao, X., pLoc-mEuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC, Genomics, 110, 50-58, (2018)
[19] Cheng, X.; Zhao, S. G.; Lin, W. Z.; Xiao, X., pLoc-mAnimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics, 33, 3524-3531, (2017)
[20] Cheng, X.; Zhao, S. G.; Xiao, X., iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics (Corrigendum, ibid., 2017, Vol.33, 2610), 33, 341-346, (2017)
[21] Chou, K. C., Prediction of signal peptides using scaled window, Peptides, 22, 1973-1979, (2001)
[22] Chou, K. C., Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[23] Chou, K. C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, 218-234, (2015)
[24] Chou, K. C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem., 17, 2337-2358, (2017)
[25] Chou, K. C.; Shen, H. B., Recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 1, 63-92, (2009)
[26] Chou, K. C.; Shen, H. B., Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms, Nat. Protoc., 3, 2, 153-162, (2008)
[27] Chou, K. C.; Zhang, C. T., Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol., 30.4, 275-349, (1995)
[28] Cortes, C.; Vapnik., V., Support-vector networks, Mach. Learn., 20.3, 273-297, (1995) · Zbl 0831.68098
[29] Cover, T. M., Nearest neighbor pattern classification, IEEE Trans. Inf. Theory, 13, 21-27, (1967) · Zbl 0154.44505
[30] Cutler, Kt; Breiman, L., Random forests, Mach. Learn., 45.1, 157-176, (2004)
[31] Dunin-Horkawicz, S., Modomics: a database of RNA modification pathways, Nucleic Acids Res., 34, 90001, D145-D149, (2006)
[32] Feng, P.; Ding, H.; Yang, H.; Chen, W.; Lin, H., iRNA-PseColl: identifying the occurrence sites of different RNA modifications by incorporating collective effects of nucleotides into PseKNC, Mol. Ther. Nucleic Acids, 7, 155-163, (2017)
[33] Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chen, W., iDNA6mA-PseKNC: identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC, Genomics, (2018)
[34] Ghauri, A. W.; Khan, Y. D.; Rasool, N.; Khan, S. A., pNitro-Tyr-PseAAC: predict nitrotyrosine sites in proteins by incorporating five features into Chou’s general PseAAC, Curr. Pharm. Des., (2018)
[35] Goodfellow, Ian J., Measuring invariances in deep networks, (International Conference on Neural Information Processing Systems Curran Associates Inc., (2009))
[36] Gumus, M.; Kiran, M. S., Crude oil price forecasting using XGBoost, (International Conference on Computer Science and Engineering, (2017)), 1100-1103
[37] Hashim, M.; Kamil, E.; Abdullah, R., Rare k-mer DNA: identification of sequence motifs and prediction of CpG island and promoter, J. Theor. Biol., 387, 88-100, (2015)
[38] Jia, C.; Lin, X.; Wang, Z., Prediction of protein S-nitrosylation sites based on adapted normal distribution bi-profile bayes and Chou’s pseudo amino acid composition, Int. J. Mol. Sci., 15, 10410-10423, (2014)
[39] Jia, J.; Li, X.; Qiu, W.; Xiao, X., iPPI-PseAAC(CGR): identify protein-protein interactions by incorporating chaos game representation into PseAAC, J. Theor. Biol., 460, 195-203, (2019)
[40] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., iCar-PseCp: identify carbonylation sites in proteins by Monto Carlo sampling and incorporating sequence coupled effects into general PseAAC, Oncotarget, 7, 34558-34570, (2016)
[41] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., pSuc-Lys: predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach, J. Theor. Biol., 394, 223-230, (2016) · Zbl 1343.92153
[42] Jia, J.; Liu, Z.; Xiao, X.; Liu, B., iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset, Anal. Biochem., 497, 48-56, (2016)
[43] Jia, J.; Zhang, L.; Liu, Z.; Xiao, X., pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC, Bioinformatics, 32, 3133-3141, (2016)
[44] Ju, Z.; Cao, J. Z.; Gu, H., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou’s general PseAAC, J. Theor. Biol., 397, 145-150, (2016)
[45] Ju, Z.; He, J. J., Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou’s general PseAAC, J. Mol. Graph Model, 77, 200-204, (2017)
[46] Ju, Z.; Wang, S. Y., Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou’s general pseudo amino acid composition, Gene, 664, 78-83, (2018)
[47] Kennedy, J., Particle Swarm Optimization, (Icnn95-international Conference on Neural Networks, (2002), IEEE)
[48] Khan, Y. D.; Rasool, N.; Hussain, W.; Khan, S. A., iPhosY-PseAAC: identify phosphotyrosine sites by incorporating sequence statistical moments into PseAAC, Mol. Biol. Rep., (2018)
[49] Khan, Y. D.; Rasool, N.; Hussain, W.; Khan, S. A., iPhosT-PseAAC: identify phosphothreonine sites by incorporating sequence statistical moments into PseAAC, Anal. Biochem., 550, 109-116, (2018)
[50] Li, Guang Qing, TargetM6A: identifying N6-methyladenosine sites from RNA sequences via position-specific nucleotide propensities and a support vector machine, IEEE Trans. Nanobiosci., 1, (2016), -1
[51] Li, Xiangtao; Yin., M., A particle swarm inspired cuckoo search algorithm for real parameter optimization, Soft Comput., 20.4, 1389-1413, (2016)
[52] Liu, B.; Wang, S.; Long, R., iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics, 33, 35-41, (2017)
[53] Liu, B.; Weng, F.; Huang, D. S., iRO-3wPseKNC: identify DNA replication origins by three-window-based PseKNC, Bioinformatics, 34, 3086-3093, (2018)
[54] Liu, B.; Yang, F., 2L-piRNA: a two-layer ensemble classifier for identifying piwi-interacting RNAs and their function, Mol. Ther. Nucleic Acids, 7, 267-277, (2017)
[55] Liu, B.; Yang, F.; Huang, D. S., iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC, Bioinformatics, 34, 33-40, (2018)
[56] Liu, Zi, pRNAm-PC: predicting N6-methyladenosine sites in RNA sequences via physical-chemical properties, Anal. Biochem., 497, 60-67, (2016)
[57] Meyer, Kate D., Comprehensive analysis of mRNA methylation reveals enrichment in 3′ UTRs and near stop codons, Cell, 149.7, (2012)
[58] Meyer, K. D.; Jaffrey, S. R., The dynamic epitranscriptome: nN6-methyladenosine and gene expression control, Nat. Rev. Mol. Cell Biol., 15, 313-326, (2014)
[59] Nilsen, T. W., Internal mRNA methylation finally finds functions, Science, 343.6176, 1207-1208, (2014)
[60] Qiu, W. R.; Jiang, S. Y.; Sun, B. Q.; Xiao, X.; Cheng, X., iRNA-2methyl: identify RNA 2′-O-methylation sites by incorporating sequence-coupled effects into general PseKNC and ensemble classifier, Med. Chem., 13, 734-743, (2017)
[61] Qiu, W. R.; Jiang, S. Y.; Xu, Z. C.; Xiao, X., iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition, Oncotarget, 8, 41178-41188, (2017)
[62] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, D., iPhos-PseEvo: identifying human phosphorylated proteins by incorporating evolutionary information into general PseAAC via grey system theory, Mol. Inf., 36, (2017), UNSP 1600010
[63] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C., iPTM-mLys: identifying multiple lysine PTM sites and their different types, Bioinformatics, 32, 3116-3123, (2016)
[64] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C., iHyd-PseCp: identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC, Oncotarget, 7, 44310-44321, (2016)
[65] Qiu, W. R.; Sun, B. Q.; Xiao, X.; Xu, Z. C.; Jia, J. H., iKcr-PseEns: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier, Genomics, 110, 239-246, (2018)
[66] Qiu, W. R.; Xiao, X.; Lin, W. Z., iMethyl-PseAAC: identification of protein methylation sites via a pseudo amino acid composition approach, Biomed. Res. Int. (BMRI), 2014, (2014)
[67] Qiu, W. R.; Xiao, X.; Xu, Z. C., iPhos-PseEn: identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier, Oncotarget, 7, 51270-51283, (2016)
[68] Rozenski, J.; Crain, P. F.; Mccloskey, J. A., The RNA modification database: 1999 update, Nucleic Acids Res., 27, 1, 196-197, (1999)
[69] Rumelhart, David E., et al., 1995. “Backpropagation: the basic theory.” Backpropagation L. Erlbaum Associates Inc.
[70] Sabooh, M. F.; Iqbal, N.; Khan, M.; Khan, M.; Maqbool, H. F., Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou’s PseKNC, J. Theor. Biol., 452, 1-9, (2018) · Zbl 1397.92232
[71] Schwartz, S.; Agarwala, S. D.; Mumbach, M. R., High-resolution mapping reveals a conserved, widespread, dynamic mRNA methylation program in yeast meiosis, Cell, 155.6, 1409-1421, (2013)
[72] Su, Z. D.; Huang, Y.; Zhang, Z. Y.; Zhao, Y. W.; Wang, D.; Chen, W.; Lin, H., iLoc-lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC, Bioinformatics, (2018)
[73] Vinje, Hilde, Comparing K-mer based methods for improved classification of 16S sequences, BMC Bioinf., 16.1, 205, (2015)
[74] Wang, L.; Zhang, R.; Mu, Y., Fu-SulfPred: identification of Protein S-sulfenylation Sites by Fusing Forests via Chou’s General PseAAC, J. Theor. Biol., 461, 51-58, (2019)
[75] Wang, S.; Dong, P.; Tian, Y., A novel method of statistical line loss estimation for distribution feeders based on feeder cluster and modified XGBoost, Energies, 10, 12, 2067, (2017)
[76] Wei, L.; Su, R.; Wang, B.; Li, X.; Zou, Q., Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites, Neurocomputing, 324, 3-9, (2019)
[77] Xiao, X.; Cheng, X.; Su, S.; Nao, Q., pLoc-mGpos: incorporate key gene ontology information into general PseAAC for predicting subcellular localization of Gram-positive bacterial proteins, Nat. Sci., 9, 331-349, (2017)
[78] Xie, H. L.; Fu, L.; Nie, X. D., Using ensemble SVM to identify human GPCRs N-linked glycosylation sites based on the general form of Chou’s PseAAC, Protein Eng. Des. Sel., 26, 735-742, (2013)
[79] Xing, Pengwei, Identifying N6-methyladenosine sites using multi-interval nucleotide pair position specificity and support vector machine, Sci. Rep., 7, 46757, (2017)
[80] Xiong, Dapeng; Zeng, J.; Gong, H., A deep learning framework for improving long-range residue-residue contact prediction using a hierarchical strategy, Bioinformatics, 33, 17, 2675-2683, (2017)
[81] Xu, Y.; Shao, X. J.; Wu, L. Y.; Deng, N. Y., iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins, Peer J., 1, e171, (2013)
[82] Xu, Y.; Wen, X.; Shao, X. J.; Deng, N. Y., iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition, Int. J. Mol. Sci. (IJMS), 15, 7594-7610, (2014)
[83] Yang, H.; Qiu, W. R.; Liu, G.; Guo, F. B.; Chen, W.; Lin, H., iRSpot-Pse6NC: identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC, Int. J. Biol. Sci., 14, 883-891, (2018)
[84] Zhang, J.; Zhao, X.; Sun, P.; Ma, Z., PSNO: predicting cysteine S-nitrosylation sites by incorporating various sequence-derived features into the general form of Chou’s PseAAC, Int. J. Mol. Sci., 15, 11204-11219, (2014)
[85] Zhang, M., Improving N6-methyladenosine site prediction with heuristic selection of nucleotide physical-chemical properties, Anal. Biochem., 508, 104-113, (2016)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.