×

A three-way decision ensemble method for imbalanced data oversampling. (English) Zbl 1456.68170

Summary: Synthetic Minority Over-sampling Technique (SMOTE) is an effective method for imbalanced data classification. Many variants of SMOTE have been proposed in the past decade. These methods mainly focused on how to select the crucial minority samples which implicitly assume the selection of key minority samples is binary. Thus, the cost of key sample selection is seldom considered. To this end, this paper proposes a three-way decision model (CTD) by considering the differences in the cost of selecting key samples. CTD first uses Constructive Covering Algorithm (CCA) to divide the minority samples into several covers. Then, a three-way decision model for key sample selection is constructed according to the density of the cover on minority samples. Finally, the corresponding threshold \(\alpha\) and \(\beta\) of CTD are obtained based on the pattern of cover distribution on minority samples, after that key samples can be selected for SMOTE oversampling. Moreover, to overcome the shortage of CCA which may contain non-optimal by randomly selecting the cover center, an ensemble model based on CTD (CTDE) is further proposed to improve the performance of CTD. Numerical experiments on 10 imbalanced datasets show that our method is superior to the comparison methods. By constructing the ensemble of the three-way decision based key sample selection, performance of the model can be effectively improved compared with several state-of-the-art methods.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T37 Reasoning under uncertainty in the context of artificial intelligence
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Zheng, Z.; Wu, X.; Srihari, R., Feature selection for text categorization on imbalanced data, ACM SIGKDD Explor. Newsl., 6, 1, 80-89, (2004)
[2] He, H.; Garcia, E. A., Learning from imbalanced data, IEEE Trans. Knowl. Data Eng., 21, 9, 1263-1284, (2009)
[3] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P., SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res., 16, 321-357, (2002) · Zbl 0994.68128
[4] Anand, A.; Pugalenthi, G.; Fogel, G. B.; Suganthan, P. N., An approach for classification of highly imbalanced data using weighting and undersampling, Amino Acids, 39, 5, 1385-1391, (2010)
[5] Liu, L.; Cai, Y.; Lu, W.; Feng, K.; Peng, C.; Niu, B., Prediction of protein-protein interactions based on PseAA composition and hybrid feature selection, Biochem. Biophys. Res. Commun., 380, 2, 318-322, (2009)
[6] He, H.; Shen, X., A ranked subspace learning method for gene expression data classification, (IC-AI, (2007)), 358-364
[7] He, Haibo; Garcia, Edwardo A., Learning from imbalanced data, IEEE Trans. Knowl. Data Eng., 21, 1263-1284, (2009)
[8] Wang, Q., A hybrid sampling SVM approach to imbalanced data classification, Abstr. Appl. Anal., 5, 22-35, (2014)
[9] Han, H.; Wang, W. Y.; Mao, B. H., Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, (International Conference on Intelligent Computing, (2005), Springer: Springer Berlin, Heidelberg), 878-887
[10] He, H.; Bai, Y.; Garcia, E. A.; Li, S., ADASYN: adaptive synthetic sampling approach for imbalanced learning, (IEEE International Joint Conference on Neural Networks, 2008, IJCNN 2008 (IEEE World Congress on Computational Intelligence), (2008), IEEE), 1322-1328
[11] Barua, S.; Islam, M. M.; Yao, X.; Murase, K., MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans. Knowl. Data Eng., 26, 2, 405-425, (2014)
[12] Batista, G. E.; Prati, R. C.; Monard, M. C., A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explor. Newsl., 6, 1, 20-29, (2004)
[13] Zhang, L.; Zhang, B., A geometrical representation of McCulloch-Pitts neural model and its applications, IEEE Trans. Neural Netw., 10, 4, 925-929, (1999)
[14] Zhang, Y.; Xing, H.; Zou, H.; Zhao, S.; Wang, X., A three-way decisions model based on constructive covering algorithm, (8th International Conference on Rough Sets and Knowledge Technology. 8th International Conference on Rough Sets and Knowledge Technology, LNAI, vol. 8171, (2013)), 346-353
[15] Senjean, B.; Hedegard, E. D.; Alam, M. M.; Knecht, S.; Fromager, E., Combining linear interpolation with extrapolation methods in range-separated ensemble density functional theory, Mol. Phys., 114, 7-8, 968-981, (2016)
[16] Yao, Y., Three-way decision: an interpretation of rules in rough set theory, (International Conference on Rough Sets and Knowledge Technology, (2009), Springer: Springer Berlin, Heidelberg), 642-649
[17] Liu, D.; Liang, D.; Wang, C., A novel three-way decision model based on incomplete information system, Knowl.-Based Syst., 91, 32-45, (2016)
[18] Yao, Y., An outline of a theory of three-way decisions, (International Conference on Rough Sets and Current Trends in Computing, (2012), Springer: Springer Berlin, Heidelberg), 1-17 · Zbl 1404.68177
[19] Yao, Y., Three-way decisions and cognitive computing, Cogn. Comput., 8, 4, 543-554, (2016)
[20] Yao, Y., Rough sets and three-way decisions, (International Conference on Rough Sets and Knowledge Technology, (2015), Springer: Springer Cham), 62-73
[21] Yao, Y.; Gao, C., Statistical interpretations of three-way decisions, (International Conference on Rough Sets and Knowledge Technology, (2015), Springer: Springer Cham), 309-320
[22] Yu, H.; Jiao, P.; Yao, Y.; Wang, G., Detecting and refining overlapping regions in complex networks with three-way decisions, Inf. Sci., 373, 21-41, (2016)
[23] Li, Y.; Zhang, L., Binary classification by modeling uncertain boundary in three-way decisions, IEEE Trans. Knowl. Data Eng., 29, 7, 1438-1451, (2017)
[24] Pawlak, Z., Rough sets, Int. J. Comput. Inf. Sci., 11, 5, 341-356, (1982) · Zbl 0501.68053
[25] Pawlak, Z., Rough Sets: Theoretical Aspects of Reasoning About Data, (1991), Kluwer Academic Publishers: Kluwer Academic Publishers Dordrecht · Zbl 0758.68054
[26] Yao, Y., Decision-theoretic rough set models, (International Conference on Rough Sets and Knowledge Technology, (2007), Springer: Springer Berlin, Heidelberg), 1-12
[27] Yao, Y.; Zhao, Y., Attribute reduction in decision-theoretic rough set models, Inf. Sci., 178, 17, 3356-3373, (2008) · Zbl 1156.68589
[28] Yao, Y., The superiority of three-way decisions in probabilistic rough set models, Inf. Sci., 181, 6, 1080-1096, (2011) · Zbl 1211.68442
[29] Yao, Y., Three-way decisions with probabilistic rough sets, Inf. Sci., 180, 3, 341-353, (2010)
[30] Xu, J.; Miao, D.; Zhang, Y.; Zhang, Z., A three-way decisions model with probabilistic rough sets for stream computing, Int. J. Approx. Reason., 88, 1-22, (2017) · Zbl 1418.68214
[31] Yao, Y., Two semantic issues in a probabilistic rough set model, Fundam. Inform., 108, 3-4, 249-265, (2011) · Zbl 1242.68344
[32] Zhou, X.; Li, H., A multi-view decision model based on decision-theoretic rough set, (International Conference on Rough Sets and Knowledge Technology, (2009), Springer: Springer Berlin, Heidelberg), 650-657
[33] Khan, M. T.; Azam, N.; Khalid, S.; Yao, J., A three-way approach for learning rules in automatic knowledge-based topic models, Int. J. Approx. Reason., 82, 210-226, (2017) · Zbl 1404.68110
[34] Herbert, J. P.; Yao, J., Learning optimal parameters in decision-theoretic rough sets, (International Conference on Rough Sets and Knowledge Technology, (2009), Springer: Springer Berlin, Heidelberg), 610-617
[35] Herbert, J. P.; Yao, J., Game-theoretic rough sets, Fundam. Inform., 108, 3-4, 267-286, (2011) · Zbl 1243.91016
[36] Li, H.; Zhang, L.; Zhou, X.; Huang, B., Cost-sensitive sequential three-way decision modeling using a deep neural network, Int. J. Approx. Reason., 85, 68-78, (2017) · Zbl 1419.68078
[37] Li, H.; Zhou, X., Risk decision making based on decision-theoretic rough set: a three-way view decision model, Int. J. Comput. Intell. Syst., 4, 1, 1-11, (2011)
[38] Li, H.; Zhou, X.; Zhao, J.; Huang, B., Cost-sensitive classification based on decision-theoretic rough set model, (International Conference on Rough Sets and Knowledge Technology, (2012), Springer: Springer Berlin, Heidelberg), 379-388
[39] Li, H.; Zhou, X.; Zhao, J.; Liu, D., Attribute reduction in decision-theoretic rough set model: a further investigation, (International Conference on Rough Sets and Knowledge Technology, (2011), Springer: Springer Berlin, Heidelberg), 466-475
[40] Liu, D.; Li, T.; Li, H., A multiple-category classification approach with decision-theoretic rough sets, Fundam. Inform., 115, 2-3, 173-188, (2012) · Zbl 1248.68492
[41] Li, X.; Yi, H.; She, Y.; Sun, B., Generalized three-way decision models based on subset evaluation, Int. J. Approx. Reason., 83, 142-159, (2017) · Zbl 1404.68168
[42] Liu, D.; Li, T.; Liang, D., Decision-theoretic rough sets with probabilistic distribution, (International Conference on Rough Sets and Knowledge Technology, (2012), Springer: Springer Berlin, Heidelberg), 389-398
[43] Hu, B.; Wong, H.; Yiu, K. C., On two novel types of three-way decisions in three-way decision spaces, Int. J. Approx. Reason., 82, 285-306, (2017) · Zbl 1404.68164
[44] Jia, X.; Zheng, K.; Li, W.; Liu, T.; Shang, L., Three-way decisions solution to filter spam email: an empirical study, (International Conference on Rough Sets and Current Trends in Computing, (2012), Springer: Springer Berlin, Heidelberg), 287-296
[45] Li, X.; Sun, B.; She, Y., Generalized matroids based on three-way decision models, Int. J. Approx. Reason., 90, 192-207, (2017) · Zbl 1419.68170
[46] Zhou, B.; Yao, Y.; Luo, J., A three-way decision approach to email spam filtering, (Canadian Conference on Artificial Intelligence, (2010), Springer: Springer Berlin, Heidelberg), 28-39
[47] Zhang, Y.; Yao, J., Gini objective functions for three-way classifications, Int. J. Approx. Reason., 81, 103-114, (2017) · Zbl 1401.68332
[48] Herbert, J. P.; Yao, J., Criteria for choosing a rough set model, Comput. Math. Appl., 57, 6, 908-918, (2009) · Zbl 1186.91066
[49] Yao, Y., Probabilistic approaches to rough sets, Expert Syst., 20, 5, 287-297, (2003)
[50] Chen, Z.; Lin, T.; Xia, X.; Xu, H.; Ding, S., A synthetic neighborhood generation based ensemble learning for the imbalanced data classification, Appl. Intell., 48, 2441-2457, (2018)
[51] Dietterich, T. G., Ensemble learning, (The Handbook of Brain Theory and Neural Networks, (2002)), 110-125
[52] Zhou, Z., Ensemble learning, (Encyclopedia of Biometrics, (2015)), 411-416
[53] Nápoles, G.; Falcon, R.; Papageorgiou, E.; Bello, R.; Vanhoof, K., Rough cognitive ensembles, Int. J. Approx. Reason., 85, 79-96, (2017) · Zbl 1419.68079
[54] Liu, B.; Wang, S.; Long, R.; Chou, K., iRSpot-EL: identify recombination spots with an ensemble learning approach, Bioinformatics, 33, 1, 35-41, (2016)
[55] Jiang, K.; Lu, J.; Xia, K., A novel algorithm for imbalance data classification based on genetic algorithm improved SMOTE, Arab. J. Sci. Eng., 41, 8, 3255-3266, (2016)
[56] Saito, T.; Rehmsmeier, M., Precrec: fast and accurate precision-recall and ROC curve calculations in R, Bioinformatics, 33, 1, 145-147, (2017)
[57] Saito, T.; Rehmsmeier, M., The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets, PLoS ONE, 10, 3, (2015)
[58] Huang, J.; Ling, C. X., Using AUC and accuracy in evaluating learning algorithms, IEEE Trans. Knowl. Data Eng., 17, 3, 299-310, (2005)
[59] Mašetic, Z.; Subasi, A.; Azemovic, J., Malicious web sites detection using C4.5 decision tree, Southeast Eur. J. Soft Comput., 5, 1, (2016)
[60] Refaeilzadeh, P.; Tang, L.; Liu, H., Cross-validation, (Encyclopedia of Database Systems, (2009), Springer US), 532-538
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.