×

Near-Bayesian support vector machines for imbalanced data classification with equal or unequal misclassification costs. (English) Zbl 1394.68280

Summary: Support Vector Machines (SVMs) form a family of popular classifier algorithms originally developed to solve two-class classification problems. However, SVMs are likely to perform poorly in situations with data imbalance between the classes, particularly when the target class is under-represented. This paper proposes a Near-Bayesian Support Vector Machine (NBSVM) for such imbalanced classification problems, by combining the philosophies of decision boundary shift and unequal regularization costs. Based on certain assumptions which hold true for most real-world datasets, we use the fractions of representation from each of the classes, to achieve the boundary shift as well as the asymmetric regularization costs. The proposed approach is extended to the multi-class scenario and also adapted for cases with unequal misclassification costs for the different classes. Extensive comparison with standard SVM and some state-of-the-art methods is furnished as a proof of the ability of the proposed approach to perform competitively on imbalanced datasets. A modified Sequential Minimal Optimization (SMO) algorithm is also presented to solve the NBSVM optimization problem in a computationally efficient manner.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

UCI-ml; LIBSVM; SMOTE
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abe, S., Analysis of support vector machines, (Proceedings of the 2002 12th IEEE workshop on neural networks for signal processing, 2002, (2002), IEEE), 89-98
[2] Akbani, R.; Kwek, S.; Japkowicz, N., Applying support vector machines to imbalanced datasets, (Machine learning: ECML 2004, Lecture notes in computer science, Vol. 3201, (2004), Springer), 39-50 · Zbl 1132.68523
[3] Amari, S.-i.; Wu, S., Improving support vector machine classifiers by modifying kernel functions, Neural Networks, 12, 783-789, (1999)
[4] Bao-Liang, L.; Xiao-Lin, W.; Yang, Y.; Hai, Z., Learning from imbalanced datasets with a MIN-MAX modular support vector machine, Frontiers of Electrical and Electronic Engineering in China, 6, 56-71, (2011)
[5] Batuwita, R.; Palade, V., Efficient resampling methods for training support vector machines with imbalanced datasets, (The 2010 international joint conference on neural networks (IJCNN), (2010), IEEE), 1-8
[6] Batuwita, R.; Palade, V., Class imbalance learning methods for support vector machines, (Imbalanced learning: foundations, algorithms, and applications, (2013), John Wiley & Sons, Inc.), 83-99
[7] Brazdil, P., & Gama, J. (1991). Statlog repository. URL: http://www.liacc,up.pt/ML/statlog/datasets.html [2007-10-22].
[8] Burges, C. J., A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery, 2, 121-167, (1998)
[9] Cao, P.; Zhao, D.; Zaiane, O., An optimized cost-sensitive SVM for imbalanced data learning, (Advances in knowledge discovery and data mining, Lecture notes in computer science, Vol. 7819, (2013), Springer), 280-292
[10] Cervantes, J.; Li, X.; Yu, W., Imbalanced data classification via support vector machines and genetic algorithms, Connection Science, 26, 335-348, (2014)
[11] Chang, C.-C.; Lin, C.-J., LIBSVM: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST), 2, 27:1-27:27, (2011)
[12] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P., SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357, (2002) · Zbl 0994.68128
[13] Choi, J.M. (2010). A selective sampling method for imbalanced data learning on support vector machines (Ph.D. thesis). Ames, IA, USA.
[14] Cortes, C.; Vapnik, V., Support-vector networks, Machine Learning, 20, 273-297, (1995) · Zbl 0831.68098
[15] Duan, W.; Jing, L.; Lu, X. Y., Imbalanced data classification using cost-sensitive support vector machine based on information entropy, Advanced Materials Research, 989, 1756-1761, (2014)
[16] Ertekin, S., Learning in extreme conditions: online and active learning with massive, imbalanced and noisy data, (2009), University Park PA, USA, (Ph.D. thesis)
[17] Fernández, A.; García, S.; Herrera, F., Addressing the classification with imbalanced data: open problems and new challenges on class distribution, (Hybrid artificial intelligent systems, Lecture notes in computer science, Vol. 6678, (2011), Springer), 1-10
[18] Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F., A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 42, 463-484, (2012)
[19] García, S.; Fernández, A.; Luengo, J.; Herrera, F., Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Information Sciences, 180, 2044-2064, (2010), Special Issue on Intelligent Distributed Information Systems
[20] Gonen, M.; Tanugur, A. G.; Alpaydin, E., Multiclass posterior probability support vector machines, IEEE Transactions on Neural Networks, 19, 130-139, (2008)
[21] He, H.; Garcia, E. A., Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering, 21, 1263-1284, (2009)
[22] Imam, T.; Ting, K. M.; Kamruzzaman, J., Z-SVM: an SVM for improved classification of imbalanced data, (AI 2006: advances in artificial intelligence, Lecture notes in computer science, Vol. 4304, (2006), Springer), 264-273
[23] Keerthi, S. S.; Shevade, S. K.; Bhattacharyya, C.; Murthy, K. R.K., Improvements to platt’s SMO algorithm for SVM classifier design, Neural Computation, 13, 637-649, (2001) · Zbl 1085.68629
[24] Koknar-Tezel, S.; Latecki, L. J., Improving SVM classification on imbalanced data sets in distance spaces, (Ninth IEEE international conference on data mining, 2009. ICDM’09, (2009), IEEE), 259-267
[25] Kubat, M.; Matwin, S., Addressing the curse of imbalanced training sets: one-sided selection, (Proceedings of the fourteenth international conference on machine learning, (1997), Morgan Kaufmann Nashville, USA), 179-186
[26] Lee, J.; Wu, Y.; Kim, H., Unbalanced data classification using support vector machines with active learning on scleroderma lung disease patterns, Journal of Applied Statistics, 42, 676-689, (2014)
[27] Lichman, M. (2013). UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
[28] Lin, Y.; Lee, Y.; Wahba, G., Support vector machines for classification in nonstandard situations, Machine Learning, 46, 191-202, (2002) · Zbl 0998.68103
[29] Maratea, A.; Petrosino, A.; Manzo, M., Adjusted f-measure and kernel scaling for imbalanced data learning, Information Sciences, 257, 331-341, (2014)
[30] Masnadi-Shirazi, H.; Vasconcelos, N., Risk minimization, probability elicitation, and cost-sensitive svms, (Proceedings of the 27th international conference on machine learning, (2010), Omnipress), 759-766
[31] Peng, L.; Ting-ting, B.; Yang, L., SVM classification for high-dimensional imbalanced data based on SNR and under-sampling, International Journal of Multimedia and Ubiquitous Engineering, 10, 105-112, (2015)
[32] Peng, L.; Xiao-yang, Y.; Ting-ting, B.; Jiu-ling, H., Imbalanced data SVM classification method based on cluster boundary sampling and DT-KNN pruning, International Journal of Signal Processing, Image Processing and Pattern Recognition, 7, 61-68, (2014)
[33] Phoungphol, P., A classification framework for imbalanced data, (2013), (Ph.D. thesis)
[34] Platt, J., Fast training of support vector machines using sequential minimal optimization, (Advances in kernel methods Support Vector Learning. Vol. 3, (1999))
[35] Raskutti, B.; Kowalczyk, A., Extreme re-balancing for SVMs: A case study, SIGKDD Explorations Newsletter, 6, 60-69, (2004)
[36] Rätsch, G. (2001). IDA benchmark repository. URL: http://ida.first.fhg.de/projects/bench/benchmarks.htm.
[37] Stecking, R.; Schebesch, K. B., Classification of large imbalanced credit client data with cluster based SVM, (Challenges at the interface of data analysis, computer science, and optimization, Studies in classification, data analysis, and knowledge organization, (2012), Springer), 443-451
[38] Tang, Y.; Zhang, Y.-Q.; Chawla, N. V.; Krasser, S., SVMs modeling for highly imbalanced classification, IEEE Transactions on Systems, Man and Cybernetics, Part B: Cybernetics, 39, 281-288, (2009)
[39] Tao, Q.; Wu, G.-W.; Wang, F.-Y.; Wang, J., Posterior probability support vector machines for unbalanced data, IEEE Transactions on Neural Networks, 16, 1561-1573, (2005)
[40] Veropoulos, K., Campbell, C., & Cristianini, N. et al. (1999). Controlling the sensitivity of support vector machines. In Proceedings of the international joint conference on AI, IJCAI (pp. 55-60).
[41] Wang, Q., A hybrid sampling SVM approach to imbalanced data classification, (Abstract and applied analysis, (2014), Hindawi Publishing Corporation)
[42] Wang, B. X.; Japkowicz, N., Boosting support vector machines for imbalanced datasets, Knowledge and Information Systems, 25, 1-20, (2010)
[43] Wang, S.; Li, Z.; Chao, W.; Cao, Q., Applying adaptive over-sampling technique based on data density and cost-sensitive SVM to imbalanced learning, (The 2012 international joint conference on neural networks (IJCNN), (2012), IEEE), 1-8
[44] Wilcoxon, F., Individual comparisons by ranking methods, (Breakthroughs in statistics, Springer series in statistics, (1992), Springer), 196-202
[45] Wu, G., & Chang, E.Y. (2003). Adaptive feature-space conformal transformation for imbalanced-data learning. In Proceedings of the twentieth international conference on machine learning (pp. 816-823).
[46] Wu, G.; Chang, E. Y., KBA: kernel boundary alignment considering imbalanced data distribution, IEEE Transactions on Knowledge and Data Engineering, 17, 786-795, (2005)
[47] Wu, S.-H.; Lin, K.-P.; Chien, H.-H.; Chen, C.-M.; Chen, M.-S., On generalizable low false-positive learning using asymmetric support vector machines, IEEE Transactions on Knowledge and Data Engineering, 25, 1083-1096, (2013)
[48] Yang, C.-Y.; Yang, J.-S.; Wang, J.-J., Margin calibration in SVM class-imbalanced learning, Neurocomputing, 73, 397-411, (2009)
[49] Yu, J.; Cheng, F.; Xiong, H.; Qu, W.; Chen, X.-w., A Bayesian approach to support vector machines for the binary classification, Neurocomputing, 72, 177-185, (2008)
[50] Zhang, Y.; Fu, P.; Liu, W.; Chen, G., Imbalanced data classification based on scaling kernel-based support vector machine, Neural Computing and Applications, 25, 927-935, (2014)
[51] Zhao, Z.; Zhong, P.; Zhao, Y., Learning SVM with weighted maximum margin criterion for classification of imbalanced data, Mathematical and Computer Modelling, 54, 1093-1099, (2011) · Zbl 1227.68098
[52] Zughrat, A., Mahfouf, M., Yang, Y., & Thornton, S. (2014). Support vector machines for class imbalance rail data classification with bootstrapping-based over-sampling and under-sampling. In 19th world congress of the international federation of automatic control (pp. 8756-8761). Cape Town, South Africa, http://dx.doi.org/10.3182/20140824-6-ZA-1003.00794.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.