VR-BFDT: a variance reduction based binary fuzzy decision tree induction method for protein function prediction. (English) Zbl 1341.92052

Summary: In protein function prediction (PFP) problem, the goal is to predict function of numerous well-sequenced known proteins whose function is not still known precisely. PFP is one of the special and complex problems in machine learning domain in which a protein (regarded as instance) may have more than one function simultaneously. Furthermore, the functions (regarded as classes) are dependent and also are organized in a hierarchical structure in the form of a tree or directed acyclic graph. One of the common learning methods proposed for solving this problem is decision trees in which, by partitioning data into sharp boundaries sets, small changes in the attribute values of a new instance may cause incorrect change in predicted label of the instance and finally misclassification. In this paper, a variance reduction based binary fuzzy decision tree (VR-BFDT) algorithm is proposed to predict functions of the proteins. This algorithm just fuzzifies the decision boundaries instead of converting the numeric attributes into fuzzy linguistic terms. It has the ability of assigning multiple functions to each protein simultaneously and preserves the hierarchy consistency between functional classes. It uses the label variance reduction as splitting criterion to select the best “attribute-value” at each node of the decision tree. The experimental results show that the overall performance of the proposed algorithm is promising.


92D20 Protein sequences, DNA sequences
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI


[1] Alvares-Cherman, E.; Metz, J.; Monard, M. C., Incorporating label dependency into the binary relevance framework for multi-label classification, Expert Syst. Appl., 39, 2, 1647-1655, (2012)
[2] Barutcuoglu, Z.; Schapire, R. E.; Troyanskaya, O. G., Hierarchical multi-label prediction of gene function, Bioinformatics, 22, 7, 830-836, (2006)
[3] Blockeel, H., De Raedt, L., Ramon, J. 2000. Top-down induction of clustering trees, arXiv preprint cs/0011032.
[4] Botstein, D.; Cherry, J. M.; Ashburner, M.; Ball, C. A.; Blake, J. A.; Butler, H.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T., Gene ontology: tool for the unification of biology, Nat. Genet., 25, 1, 25-29, (2000)
[5] Breiman, L., Bagging predictors, Mach. Learn., 24, 2, 123-140, (1996) · Zbl 0858.68080
[6] Cerri, R.; de Carvalho, A. C.P. L.F., Comparing local and global hierarchical multilabel classification methods using decision trees, Fortaleza, Anais do V Workshop em Algoritmos e Aplicações de Mineração de Dados, 75-82, (2009)
[7] Cerri, R.; de Carvalho, A. C., Hierarchical multilabel protein function prediction using local neural networks, Advances in Bioinformatics and Computational Biology, 10-17, (2011), Springer
[8] R. Cerri and A. C. P. L. F. de Carvalho, 2010. New top-down methods using SVMs for Hierarchical Multilabel Classification problems. In: Proceedings of the International Joint Conference on Neural Networks. IJCNN, 1-8, July.
[9] R. Cerri, R. C. Barros, and A. C. P. L. F. de Carvalho,2012. A genetic algorithm for hierarchical multi-label classification. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing. SAC, p. 250.
[10] Cesa-Bianchi, N.; Re, M.; Valentini, G., Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference, Mach. Learn., 88, 1-2, 209-241, (2011) · Zbl 1243.68234
[11] Chandra, B.; Varghese, P. Paul, Fuzzifying gini index based decision trees, Expert Syst. Appl., 36, 4, 8549-8559, (2009)
[12] Chang, Y.-C.; Chen, S.-M.; Liau, C.-J., Multilabel text categorization based on a new linear classifier learning method and a category-sensitive refinement method, Expert Syst. Appl., 34, 3, 1948-1953, (2008)
[13] Chen, B.; Hu, J., Hierarchical multi-label classification based on over-sampling and hierarchy constraint for gene function prediction, IEEJ Trans. Electr. Electron. Eng., 7, 2, 183-189, (2012)
[14] Chen, W.; Feng, P.-M.; Lin, H.; Chou, K.-C., Irspot-psednc: identify recombination spots with pseudo-dinucleotide composition, Nucleic Acids Res., 1450, (2013)
[15] Chen, W.; Feng, P.-M.; Deng, E.-Z.; Lin, H.; Chou, K.-C., Itis-psetnc: a sequence-based predictor for identifying translation initiation site in human genes using pseudo-trinucleotide composition, Anal. Biochem., 462, 76-83, (2014)
[16] Chou, K.-C., Some remarks on protein attribute prediction and pseudo-amino acid composition, J. Theor. Biol., 273, 1, 236-247, (2011) · Zbl 1405.92212
[17] Chou, K.-C., Some remarks on predicting multi-label attributes in molecular biosystems, Mol. BioSyst., 9, 6, 1092-1100, (2013)
[18] Clare, A., King, R.D., 2003. Predicting gene function, in Saccharomyces Cerevisiae. Bioinformatics, 19 (Suppl. 2) ii42-ii49
[19] J. Davis and M. Goadrich, 2006. The relationship between precision-recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine Learning. ICML, pp. 233-240.
[20] DeRisi, J. L.; Iyer, V. R.; Brown, P. O., Exploring the metabolic and genetic control of gene expression on a genomic scale, Science, 278, 5338, 680-686, (1997)
[21] A. Dimou, G. Tsoumakas, V. Mezaris, I. Kompatsiaris, and L. Vlahavas, 2009. An empirical study of multi-label learning methods for video annotation. In: Proceedings of the Seventh International Workshop on Content-Based Multimedia Indexing. CBMI, pp. 19-24.
[22] Ding, Y.-S.; Zhang, T.-L.; Chou, K.-C., Prediction of protein structure classes with pseudo-amino acid composition and fuzzy support vector machine network, Protein Pept. Lett., 14, no. 8, 811-815, (2007)
[23] Džeroski, V. Gjorgjioski, I. Slavkov, J. Struyf. 2007. Analysis of Time Series Data with Predictive Clustering Trees, In KDID06, LNCS vol. 4747, p. 63-80.
[24] Eisen, M. B.; Spellman, P. T.; Brown, P. O.; Botstein, D., Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci., 95, 25, 14863-14868, (1998)
[25] Feng, S.; Xu, D., Transductive multi-instance multi-label learning algorithm with application to automatic image annotation, Expert Syst. Appl., 37, 1, 661-670, (2010)
[26] Gasch, A. P.; Spellman, P. T.; Kao, C. M.; Carmel-Harel, O.; Eisen, M. B.; Storz, G.; Botstein, D.; Brown, P. O., Genomic expression programs in the response of yeast cells to environmental changes, Sci. Signal., 11, 12, 4241, (2000)
[27] Gasch, A. P.; Huang, M.; Metzner, S.; Botstein, D.; Elledge, S. J.; Brown, P. O., Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog mec1p, Mol. Biol. Cell, 12, 10, 2987-3003, (2001)
[28] Jiang, J. Q.; McQuay, L. J., Predicting protein function by multi-label correlated semi-supervised learning, IEEE/ACM Trans. Comput. Biol. Bioinform., 9, 1059-1069, (2012)
[29] D. Kocev, C. Vens, J. Struyf, and S. Džeroski,2007. Ensembles of multi-objective decision trees. In: Proceedings of the 18th European Conference on machine learning. Lecture Notes in Computer Science, pp. 624-631.
[30] Kocev, D.; Vens, C.; Struyf, J.; Džeroski, S., Tree ensembles for predicting structured outputs, Pattern Recognit., 46, 3, 817-833, (2013)
[31] Lee, D.; Redfern, O.; Orengo, C., Predicting protein function from sequence and structure, Nat. Rev. Mol. Cell Biol., 8, 12, 995-1005, (2007)
[32] Lin, H.; Deng, E.-Z.; Ding, H.; Chen, W.; Chou, K.-C., Ipro54-pseknc: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo-k-tuple nucleotide composition, Nucleic Acids Res., 42, 21, 12961-12972, (2014)
[33] Lin, W.-Z.; Fang, J.-A.; Xiao, X.; Chou, K.-C., Iloc-animal: a multi-label learning classifier for predicting subcellular localization of animal proteins, Mol. BioSyst., 9, 4, 634-644, (2013)
[34] Liu, B.; Xu, J.; Lan, X.; Xu, R.; Zhou, J.; Wang, X.; Chou, K.-C., Idna-prot| dis: identifying DNA-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo-amino acid composition, PloS ONE, 9, 9, e106691, (2014)
[35] Lowen, R., Fuzzy Set Theory, Basic Concepts, Techniques and Bibliography, (1996), Kluwer Academic Publishers Dordrecht · Zbl 0854.04006
[36] Luscombe, N. M.; Greenbaum, D.; Gerstein, M., What is bioinformatics? A proposed definition and overview of the field, Methods Inf. Med., 40, 4, 346-358, (2001)
[37] Moosavi, S.; Rahgozar, M.; Rahimi, A., Protein function prediction using neighbor relativity in protein-protein interaction network, Comput. Biol. Chem., 43, 11-16, (2013)
[38] Nabieva, E.; Jim, K.; Agarwal, A.; Chazelle, B.; Singh, M., Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps, Bioinformatics, 21, Suppl 1, i302-i310, (2005)
[39] Nguyen, C. D.; Gardiner, K. J.; Cios, K. J., Protein annotation from protein interaction networks and gene ontology, J. Biomed. Inform., 44, 5, 824-829, (2011)
[40] Otero, F. E.B.; Freitas, A. a.; Johnson, C. G., A hierarchical multi-label classification ant colony algorithm for protein function prediction, Memet. Comput., 2, 3, 165-181, (2010)
[41] H. Qiu, 2011. Fuzzy SLIQ decision tree based on classification sensitivity. August, pp. 18-25.
[42] Qiu, W.-R.; Xiao, X.; Chou, K.-C., Irspot-tncpseaac: identify recombination spots with trinucleotide composition and pseudo-amino acid components, Int. J. Mol. Sci., 15, 2, 1746-1766, (2014)
[43] Ruepp, A.; Zollner, A.; Maier, D.; Albermann, K.; Hani, J.; Mokrejs, M.; Tetko, I.; Güldener, U.; Mannhaupt, G.; Münsterkötter, M., The funcat, a functional annotation scheme for systematic classification of proteins from whole genomes, Nucleic Acids Res., 32, 18, 5539-5545, (2004)
[44] Schietgat, L.; Vens, C.; Struyf, J.; Blockeel, H.; Kocev, D.; Dzeroski, S., Predicting gene function using hierarchical multi-label decision tree ensembles, BMC Bioinform., 11, 2, (2010)
[45] Shen, H.-B.; Yang, J.; Liu, X.-J.; Chou, K.-C., Using supervised fuzzy clustering to predict protein structural classes, Biochem. Biophys. Res. Commun., 334, 2, 577-581, (2005)
[46] Slavkov, I.; Gjorgjioski, V.; Struyf, J.; Džeroski, S., Finding explained groups of time-course gene expression profiles with predictive clustering trees, Mol. BioSyst., 6, 4, 729-740, (2010)
[47] Sokolov, A.; Ben-Hur, A., Hierarchical classification of gene ontology terms using the gostruct method, J. Bioinform. Comput. Biol., 8, 2, 357-376, (2010)
[48] Spellman, P. T.; Sherlock, G.; Zhang, M. Q.; Iyer, V. R.; Anders, K.; Eisen, M. B.; Brown, P. O.; Botstein, D.; Futcher, B., Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization, Mol. Biol. Cell, 9, 12, 3273-3297, (1998)
[49] Stojanova, D., Considering autocorrelation in predictive models, Informatica, 37, 107-108, (2013)
[50] Struyf, J.; Dzeroski, S., Constraint based induction of multi-objective regression trees, (Proc. of the 4th International Workshop on Knowledge Discovery in Inductive Databases KDID - LNCS 3933, (2006), Springer), 222-233
[51] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P. Vlahavas, 2008. Multi-label classification of music into emotions. In: Proceedings of the International Society for Music Information Retrieval Conference, ISMIR, pp. 325-330, 8.
[52] Valentini, G., True path rule hierarchical ensembles for genome-wide gene function prediction, IEEE/ACM Trans. Comput. Biol. Bioinform., 8, 3, 832-847, (2011)
[53] Vens, C.; Schietgat, L.; Struyf, J.; Blockeel, H.; Kocev, D.; Dzeroski, S., Predicting gene functions using predictive clustering trees, (2010), Springer · Zbl 1211.68172
[54] Vens, C.; Struyf, J.; Schietgat, L.; Džeroski, S.; Blockeel, H., Decision trees for hierarchical multi-label classification, Mach. Learn., 73, 2, 185-214, (2008)
[55] Wu, Z.-C.; Xiao, X.; Chou, K.-C., Iloc-plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites, Mol. BioSyst., 7, 12, 3287-3297, (2011)
[56] Wu, Z.-C.; Xiao, X.; Chou, K.-C., Iloc-gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins, Protein Pept. Lett., 19, 1, 4-14, (2012)
[57] Xiao, X.; Wu, Z.-C.; Chou, K.-C., Iloc-virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites, J. Theor. Biol., 284, 1, 42-51, (2011) · Zbl 1397.92238
[58] Xu, Y.; Wen, X.; Wen, L.-S.; Wu, L.-Y.; Deng, N.-Y.; Chou, K.-C., Initro-tyr: prediction of nitrotyrosine sites in proteins with general pseudo-amino acid composition, PloS ONE, 9, 8, e105018, (2014)
[59] Yang, A.; Li, R.; Zhu, W.; Yue, G., A novel method for protein function prediction based on sequence numerical features, Match-Commun. Math. Comput. Chem., 67, 3, 833, (2012)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.