×

Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. (English) Zbl 1406.92447

Summary: DNA-binding proteins (DBPs) are responsible for several cellular functions, starting from our immunity system to the transport of oxygen. In the recent studies, scientists have used supervised machine learning based methods that use information from the protein sequence only to classify the DBPs. Most of the methods work effectively on the train sets but performance of most of them degrades in the independent test set. It shows a room for improving the prediction method by reducing over-fitting. In this paper, we have extracted several features solely using the protein sequence and carried out two different types of feature selection on them. Our results have proven comparable on training set and significantly improved on the independent test set. On the independent test set our accuracy was 82.26% which is 1.62% improved compared to the previous best state-of-the-art methods. Performance in terms of sensitivity and area under receiver operating characteristic curve for the independent test set was also higher and they were 0.95 and 0.823 respectively.

MSC:

92D20 Protein sequences, DNA sequences
68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
92-04 Software, source code, etc. for problems pertaining to biology
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Buck, M. J.; Lieb, J. D., Chip-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments, Genomics, 83, 3, 349-360, (2004)
[2] Cai, L.; Huang, T.; Su, J.; Zhang, X.; Chen, W.; Zhang, F.; He, L.; Chou, K.-C., Implications of newly identified brain eqtl genes and their interactors in schizophrenia, Mol. Ther. Nucleic Acids, 12, 433-442, (2018)
[3] Chang, J.-M.; Su, E. C.-Y.; Lo, A.; Chiu, H.-S.; Sung, T.-Y.; Hsu, W.-L., Psldoc: protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis, Proteins Struct. Funct. Bioinf., 72, 2, 693-710, (2008)
[4] Chen, W.; Ding, H.; Feng, P.; Lin, H.; Chou, K.-C., iacp: a sequence-based tool for identifying anticancer peptides, Oncotarget, 7, 13, 16895-16909, (2016)
[5] Chen, W.; Feng, P.; Ding, H.; Lin, H.; Chou, K.-C., irna-methyl: Identifying n6-methyladenosine sites using pseudo nucleotide composition, Anal. Biochem., 490, 26-33, (2015)
[6] Chen, W.; Feng, P.; Yang, H.; Ding, H.; Lin, H.; Chou, K.-C., irna-3typea: Identifying three types of modification at rnas adenosine sites, Mol. Ther. Nucleic Acids, 11, 468-474, (2018)
[7] Chen, W.; Feng, P.-M.; Lin, H.; Chou, K.-C., irspot-psednc: identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res., 41, 6, e68, (2013)
[8] Chen, W.; Feng, P.-M.; Lin, H.; Chou, K.-C., iss-psednc: identifying splicing sites using pseudo dinucleotide composition, Biomed. Res. Int., (2014)
[9] Chen, W.; Feng, P.-M.; Lin, H.; Chou, K.-C., iss-psednc: identifying splicing sites using pseudo dinucleotide composition, Biomed. Res. Int., 12, (2014)
[10] Chen, W.; Lei, T.-Y.; Jin, D.-C.; Lin, H.; Chou, K.-C., Pseknc: a flexible web server for generating pseudo k-tuple nucleotide composition, Anal. Biochem., 456, (2014)
[11] Chen, W.; Lin, H.; Chou, K.-C., Pseudo nucleotide composition or pseknc: an effective formulation for analyzing genomic sequences, Mol Biosyst., (2015)
[12] Cheng, X.; Lin, W.-Z.; Xiao, X.; Chou, K.-C., ploc_bal-manimal: predict subcellular localization of animal proteins by balancing training dataset and pseaac, Bioinformatics, bty628, (2018)
[13] Cheng, X.; Xiao, X.; Chou, K.-C., ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general pseaac, Mol. Biosyst., 13, 1722-1727, (2017)
[14] Cheng, X.; Xiao, X.; Chou, K.-C., ploc-mvirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal go information into general pseaac, Gene, 628, 315-321, (2017)
[15] Cheng, X.; Xiao, X.; Chou, K.-C., ploc-meuk: predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general pseaac, Genomics, 110, 1, 50-58, (2018)
[16] Cheng, X.; Xiao, X.; Chou, K.-C., ploc-mgneg: predict subcellular localization of gram-negative bacterial proteins by deep gene ontology learning via general pseaac, Genomics, 110, 4, 231-239, (2018)
[17] Cheng, X.; Zhao, S.-G.; Lin, W.-Z.; Xiao, X.; Chou, K.-C., ploc-manimal: predict subcellular localization of animal proteins with both single and multiple sites, Bioinformatics, 33, 22, 3524-3531, (2017)
[18] Cheng, X.; Zhao, S.-G.; Xiao, X.; Chou, K.-C., iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals, Bioinformatics, 33, 3, 341-346, (2017)
[19] Chou, C.-C.; Lin, T.-W.; Chen, C.-Y.; Wang, A. H.-J., Crystal structure of the hyperthermophilic archaeal DNA-binding protein sso10b2 at a resolution of 1.85 angstroms, J. Bacteriol., 185, 14, 4066-4073, (2003)
[20] Chou, K.-C., A novel approach to predicting protein structural classes in a (20-1)-d amino acid composition space, Proteins Struct. Funct. Bioinf., 21, 4, 319-344, (1995)
[21] Chou, K.-C., Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins Struct. Funct. Bioinf., 43, 3, 246-255, (2001)
[22] Chou, K.-C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics, 21, 1, 10-19, (2005)
[23] Chou, K.-C., Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology, Curr. Proteomics, 6, (2009)
[24] Chou, K.-C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol., 273, 236-247, (2011) · Zbl 1405.92212
[25] Chou, K.-C., Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol., 273, 1, 236-247, (2011) · Zbl 1405.92212
[26] Chou, K.-C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. Biosyst., 9, 1092-1100, (2013)
[27] Chou, K.-C., Impacts of bioinformatics to medicinal chemistry, Med. Chem., 11, (2014)
[28] Chou, K.-C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem., 17, 21, 2337-2358, (2017)
[29] Chou, K.-C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science, Curr. Top. Med. Chem., 17 21, 2337-2358, (2017)
[30] Chou, K.-C.; Shen, H.-B., Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci., 01, 02, 30, (2009)
[31] Chowdhury, S. Y.; Shatabda, S.; Dehzangi, A., iDNAProt-ES: identification of DNA-binding proteins using evolutionary and structural features, Sci. Rep., 7, 1, 14938, (2017)
[32] Dong, Q.; Wang, S.; Wang, K.; Liu, X.; Liu, B., Identification of DNA-binding proteins by auto-cross covariance transformation, Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, 470-475, (2015), IEEE
[33] Fang, Y.; Guo, Y.; Feng, Y.; Li, M., Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features, Amino Acids, 34, 1, 103-109, (2007)
[34] Fang, Y.; Guo, Y.; Feng, Y.; Li, M., Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features, Amino Acids, 34, 1, 103-109, (2008)
[35] Fawcett, T., An introduction to roc analysis, Pattern Recognit. Lett., 27, 8, 861-874, (2006)
[36] Feng, P.-M.; Chen, W.; Lin, H.; Chou, K.-C., ihsp-pseraaac: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition, Anal. Biochem., 442, 1, 118-125, (2013)
[37] Freeman, K.; Gwadz, M.; Shore, D., Molecular and genetic analysis of the toxic effect of rap1 overexpression in yeast, Genetics, 141, 4, 1253-1262, (1995)
[38] Geurts, P.; Ernst, D.; Wehenkel, L., Extremely randomized trees, Mach. Learn., 63, 1, 3-42, (2006) · Zbl 1110.68124
[39] Ghandi, M.; Mohammad-Noori, M.; Beer, M. A., Robust k-mer frequency estimation using gapped k-mers, J. Math. Biol., 69, 2, 469-500, (2014) · Zbl 1302.92102
[40] Helwa, R.; Hoheisel, J. D., Analysis of DNA-protein interactions: from nitrocellulose filter binding assays to microarray studies, Anal. Bioanal. Chem., 398, 6, 2551-2561, (2010)
[41] Ho, T. K., Random decision forests, Document Analysis and Recognition, 1995., Proceedings of the Third International Conference on, 1, 278-282, (1995), IEEE
[42] Islam, M. M.; Saha, S.; Rahman, M. M.; Shatabda, S.; Farid, D. M.; Dehzangi, A., iprotgly-ss: identifying protein glycation sites using sequence and structure based features, Proteins Struct. Funct. Bioinf, (2018)
[43] Izenman, A. J., Linear discriminant analysis, Modern Multivariate Statistical Techniques, 237-280, (2013), Springer
[44] Jia, J.; Liu, Z.; Xiao, X.; Chou, K.-C., icar-psecp: identify carbonylation sites in proteins by monte carlo sampling and incorporating sequence coupled effects into general pseaac, Oncotarget, 7, 34558-34570, (2016)
[45] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., ippi-esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac, J. Theor. Biol., 377, 47-56, (2015)
[46] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., isuc-pseopt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset, Anal. Biochem., 497, 48-56, (2016)
[47] Jia, J.; Liu, Z.; Xiao, X.; Liu, B.; Chou, K.-C., psuc-lys: predict lysine succinylation sites in proteins with pseaac and ensemble random forest approach, J. Theor. Biol., 394, 223-230, (2016) · Zbl 1343.92153
[48] Ju, Z.; Wang, S.-Y., Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou’s general pseudo amino acid composition, Gene, 664, (2018)
[49] Kumar, K. K.; Pugalenthi, G.; Suganthan, P., DNA-prot: identification of dna binding proteins from protein sequence information using random forest, J. Biomol. Struct. Dyn., 26, 6, 679-686, (2009)
[50] Kumar, M.; Gromiha, M. M.; Raghava, G. P., Identification of DNA-binding proteins using support vector machines and evolutionary profiles, BMC Bioinform., 8, 1, 463, (2007)
[51] Langlois, R. E.; Lu, H., Boosting the prediction and understanding of DNA-binding domains from sequence, Nucleic Acids Res., 38, 10, 3149-3158, (2010)
[52] Lin, W.-Z.; Fang, J.-A.; Xiao, X.; Chou, K.-C., idna-prot: identification of dna binding proteins using random forest with grey model, PLoS One, 6, 9, e24756, (2011)
[53] Liu, B.; Fang, L.; Liu, F.; Wang, X.; Chen, J.; Chou, K.-C., Identification of real microrna precursors with a pseudo structure status composition approach, PLoS One, 10, 3, 1-20, (2015)
[54] Liu, B.; Fang, L.; Long, R.; Lan, X.; Chou, K.-C., ienhancer-2l: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition, Bioinformatics, 32, 3, 362-369, (2016)
[55] Liu, B.; Liu, F.; Wang, X.; Chen, J.; Fang, L.; Chou, K.-C., Pse-in-one: a web server for generating various modes of pseudo components of dna, rna, and protein sequences, Nucleic Acids Res., 43, W65-71, (2015)
[56] Liu, B.; Wang, S.; Wang, X., Dna binding protein identification by combining pseudo amino acid composition and profile-based protein representation, Sci. Rep., 5, 15479, (2015)
[57] Liu, B.; Wu, H.; Chou, K.-C., Pse-in-one 2.0: an improved package of web servers for generating various modes of pseudo components of dna, rna, and protein sequences, Nat. Sci., 09, 67-91, (2017)
[58] Liu, B.; Wu, H.; Zhang, D.; Wang, X., Pse-analysis: a python package for dna/rna and protein/peptide sequence analysis based on pseudo components and kernel methods, Oncotarget, 8, 13338-13343, (2017)
[59] Liu, B.; Xu, J.; Fan, S.; Xu, R.; Zhou, J.; Wang, X., Psedna-pro: DNA-binding protein identification by combining Chou’s pseaac and physicochemical distance transformation, Mol. Inform., 34, (2014)
[60] Liu, B.; Xu, J.; Fan, S.; Xu, R.; Zhou, J.; Wang, X., Psedna-pro: DNA-binding protein identification by combining Chou’s pseaac and physicochemical distance transformation, Mol. Inform., 34, 1, 8-17, (2015)
[61] Liu, B.; Xu, J.; Lan, X.; Xu, R.; Zhou, J.; Wang, X.; Chou, K.-C., idna-prot| dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition, PLoS One, 9, 9, e106691, (2014)
[62] Lou, W.; Wang, X.; Chen, F.; Chen, Y.; Jiang, B.; Zhang, H., Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and gaussian naive bayes, PLoS One, 9, 1, e86703, (2014)
[63] Maruf, M. A.A.; Shatabda, S., irspot-sf: prediction of recombination hotspots by incorporating sequence based features into Chou’s pseudo components, Genomics, (2018)
[64] Mohri, M.; Rostamizadeh, A.; Talwalkar, A., Foundations of Machine Learning, (2012), MIT press · Zbl 1318.68003
[65] Ng, A. Y.; Jordan, M. I., On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes, Advances in Neural Information Processing Systems, 841-848, (2002)
[66] Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., Scikit-learn: machine learning in python, J. Mach. Learn. Res., 12, Oct, 2825-2830, (2011) · Zbl 1280.68189
[67] Qiu, W.-R.; Sun, B.-Q.; Xiao, X.; Xu, Z.-C., ihyd-psecp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general pseaac, Oncotarget, 7, (2016)
[68] Qiu, W.-R.; Sun, B.-Q.; Xiao, X.; Xu, Z.-C.; Chou, K.-C., iptm-mlys: identifying multiple lysine ptm sites and their different types, Bioinformatics, 32, 20, 3116-3123, (2016)
[69] Qiu, W.-R.; Sun, B.-Q.; Xiao, X.; Xu, Z.-C.; Jia, J.-H.; Chou, K.-C., ikcr-pseens: Identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier, Genomics, 110, 5, 239-246, (2018)
[70] Qiu, W.-R.; Xiao, X.; Xu, Z.-C.; Chou, K.-C., iphos-pseen: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier, Oncotarget, 7, 51270-51283, (2016)
[71] Rahman, M. S.; Shatabda, S.; Saha, S.; Kaykobad, M.; Rahman, M. S., Dpp-pseaac: a DNA-binding protein prediction model using Chou’s general pseaac, J. Theor. Biol., 452, 22-34, (2018)
[72] Rahman, S.; Aktar, U.; Jani, R.; Shatabda, S., ipromoter-fsen: identification of bacterial σ70 promoter sequences using feature subspace based ensemble classifier, Genomics, (2018)
[73] Safavian, S. R.; Landgrebe, D., A survey of decision tree classifier methodology, IEEE Trans. Syst. Man Cybern., 21, 3, 660-674, (1991)
[74] Saifur Rahman, M.; Shatabda, S.; Saha, S.; Kaykobad, M.; Rahman, M., Dpp-pseaac: a DNA-binding protein prediction model using Chou’s general pseaac, J. Theor. Biol., 452, (2018)
[75] Schapire, R. E., The boosting approach to machine learning: an overview, Nonlinear Estimation and Classification, 149-171, (2003), Springer · Zbl 1142.62372
[76] Wei, L.; Tang, J.; Zou, Q., Local-dpp: an improved DNA-binding protein prediction method by exploring local evolutionary information, Inf. Sci., 384, 135-144, (2017)
[77] Xia, X., Bioinformatics and drug discovery, Curr. Top. Med. Chem., 17, 1709-1726, (2017)
[78] Xiao, X.; Cheng, X.; Chen, G.; Mao, Q.; Chou, K.-C., Ploc_bal-mgpos: predict subcellular localization of gram-positive bacterial proteins by quasi-balancing training dataset and pseaac, Genomics, (2018)
[79] Xiao, X.; Cheng, X.; Chen, G.; Mao, Q.; Chou, K.-C., Ploc_bal-mgpos: predict subcellular localization of gram-positive bacterial proteins by quasi-balancing training dataset and pseaac, Genomics, (2018)
[80] Xu, R.; Zhou, J.; Liu, B.; He, Y.; Zou, Q.; Wang, X.; Chou, K.-C., Identification of DNA-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach, J. Biomol. Struct. Dyn., 33, 8, 1720-1730, (2015)
[81] Zaman, R.; Chowdhury, S. Y.; Rashid, M. A.; Sharma, A.; Dehzangi, A.; Shatabda, S., Hmmbinder: DNA-binding protein prediction using hmm profile based features, Biomed. Res. Int., 2017, (2017)
[82] Zhao, X.-W.; Li, X.-T.; Ma, Z.-Q.; Ma, Z.-Q.; Yin, M.-H., Identify DNA-binding proteins with optimal Chou’s amino acid composition, Protein Pept. Lett., 19, 4, 398-405, (2012)
[83] Zhao, X.-W.; Li, X.-T.; Ma, Z.-Q.; Yin, M.-H., Identify DNA-binding proteins with optimal Chou’s amino acid composition, Protein Pept. Lett., 19, 398-405, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.