A cost-sensitive ensemble method for class-imbalanced datasets. (English) Zbl 1272.68347

Summary: In imbalanced learning methods, resampling methods modify an imbalanced dataset to form a balanced dataset. Balanced data sets perform better than imbalanced datasets for many base classifiers. This paper proposes a cost-sensitive ensemble method based on cost-sensitive support vector machine (SVM), and query-by-committee (QBC) to solve imbalanced data classification. The proposed method first divides the majority-class dataset into several subdatasets according to the proportion of imbalanced samples and trains subclassifiers using the AdaBoost method. Then, the proposed method generates candidate training samples by the QBC active learning method and uses cost-sensitive SVMs to learn the training samples. By using 5 class-imbalanced datasets, experimental results show that the proposed method has higher area under ROC curve (AUC), F-measure, and G-mean than many existing class-imbalanced learning methods.


68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI


[1] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P., SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357 (2002) · Zbl 0994.68128
[2] Gao, M.; Hong, X.; Chen, S.; Harris, C. J., A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems, Neurocomputing, 74, 3456-3466 (2011)
[3] Weiss, G., Mining with rarity: a unifying framework, SIGKDD Explorations, 6, 1, 7-19 (2004)
[4] Seiffert, C.; Khoshgoftaar, T. M.; van Hulse, J.; Napolitano, A., RUSBoost: a hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man, and Cybernetics A, 40, 1, 185-197 (2010)
[5] Kim, M. S., An effective under-sampling method for class imbalance data problem, Proceedings of the 8th Symposium on Advanced Intelligent Systems
[6] Yen, S. J.; Lee, Y. S., Cluster-based under-sampling approaches for imbalanced data distributions, Expert Systems with Applications, 36, 3, 5718-5727 (2009)
[7] Liu, X. Y.; Wu, J. X.; Zhou, Z. H., Exploratory undersampling for class-imbalance learning, IEEE Transactions on Systems, Man, and Cybernetics B, 39, 2, 539-550 (2009)
[8] Drummond, C.; Holte, R. C., C4.5 decision tree, class imbalance, and cost sensitivity: why under-sampling beats over-sampling, Proceedings of the Workshop on Learning from Imbalanced Data Sets II, International Conference on Machine Learning
[9] Chawla, N. V.; Lazarevic, A.; Hall, L. O.; Bowyer, K. W., SMOTEBoost: improving prediction of the minority class in boosting, Proceedings of the 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD ’03)
[10] Wang, S.; Li, Z.; Chao, W.; Cao, Q., Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning, The International Joint Conference on Neural Networks (IJCNN ’12)
[11] Gao, M.; Hong, X.; Chen, S.; Harris, C. J., Probability density function estimation based over-sampling for imbalanced two-class problems, The International Joint Conference on Neural Networks (IJCNN ’12)
[12] Elkan, C., The foundations of cost-sensitive learning, Proceedings of the 17th International Joint Conference on Artificial Intelligence
[13] Wang, B. X.; Japkowicz, N., Boosting support vector machines for imbalanced data sets, Knowledge and Information Systems, 25, 1, 1-20 (2010)
[14] Sun, Y.; Kamel, M. S.; Wong, A. K. C.; Wang, Y., Cost-sensitive boosting for classification of imbalanced data, Pattern Recognition, 40, 12, 3358-3378 (2007) · Zbl 1122.68505
[15] Guo, H.; Viktor, H. L., Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach, SIGKDD Explorations, 6, 1, 30-39 (2004)
[16] Akbani, R.; Kwek, S.; Japkowicz, N., Applying support vector machines to imbalanced datasets, Proceedings of the 15th European Conference on Machine Learning (ECML ’04)
[17] Tang, Y.; Zhang, Y. Q.; Chawla, N. V., SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man, and Cybernetics B, 39, 1, 281-288 (2009)
[18] Wang, J.; You, J.; Li, Q.; Xu, Y., Extract minimum positive and maximum negative features for imbalanced binary classification, Pattern Recognition, 45, 1136-1145 (2012)
[19] García-Pedrajas, N.; Pérez-Rodríguez, J.; de Haro-García, A., OligoIS: scalable instance selection for class-imbalanced data sets, IEEE Transactions on Systems, Man, and Cybernetics B (2012)
[20] Veropoulos, K.; Campbell, C.; Cristianini, N., Controlling the sensitivity of support vector machines, Proceedings of the International Joint Conference on Artificial Intelligence
[21] Seung, H. S.; Opper, M.; Sompolinsky, H., Query by committee, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory
[22] Freund, Y.; Seung, H. S.; Shamir, E.; Tishby, N., Selective sampling using the query by committee algorithm, Machine Learning, 28, 2-3, 133-168 (1997) · Zbl 0881.68093
[23] Freund, Y.; Schapire, R. E., A decision-theoretic generalization of on-line learning and an application to boosting, Proceedings of the 2nd European Conference on Computational Learning Theory
[24] Joshi, M. V.; Kumar, V.; Agarwal, R. C., Evaluating boosting algorithms to classify rare classes: comparison and improvements, Proceedings of the 1st IEEE International Conference on Data Mining (ICDM ’01)
[25] Fawcett, T., ROC graphs: notes and practical considerations for researchers, HPL-2003-4 (2003), Palo Alto, Calif, USA: HP Labs, Palo Alto, Calif, USA
[26] Lewis, D.; Gale, W., Training text classifiers by uncertainty sampling, Proceedings of the 7th Annual International ACM SIGIR Conference on Research and Development in Information
[27] Frank, A.; Asuncion, A., UCI Machine Learning Repository (2010), Irvine, Calif, USA: University of California, School of Information and Computer Science, Irvine, Calif, USA
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.