×

An improved algorithm for SVMs classification of imbalanced data sets. (English) Zbl 1186.68358

Palmer-Brown, Dominic (ed.) et al., Engineering applications of neural networks. 11th international conference, EANN 2009, London, UK, August 27–29, 2009. Proceedings. Berlin: Springer (ISBN 978-3-642-03968-3/pbk; 978-3-642-03969-0/ebook). Communications in Computer and Information Science 43, 108-118 (2009).
Summary: Support Vector Machines (SVMs) have strong theoretical foundations and excellent empirical success in many pattern recognition and data mining applications. However, when induced by imbalanced training sets, where the examples of the target class (minority) are outnumbered by the examples of the non-target class (majority), the performance of SVM classifier is not so successful. In medical diagnosis and text classification, for instance, small and heavily imbalanced data sets are common. In this paper, we propose the Boundary Elimination and Domination algorithm (BED) to enhance SVM class-prediction accuracy on applications with imbalanced class distributions. BED is an informative resampling strategy in input space. In order to balance the class distributions, our algorithm considers density information in training sets to remove noisy examples of the majority class and generate new synthetic examples of the minority class. In our experiments, we compared BED with original SVM and Synthetic Minority Oversampling Technique (SMOTE), a popular resampling strategy in the literature. Our results demonstrate that this new approach improves SVM classifier performance on several real world imbalanced problems.
For the entire collection see [Zbl 1181.68012].

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition
68W05 Nonnumerical algorithms

Software:

UCI-ml; SMOTE
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Boser, B.E., Guyon, I.M., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152. ACM Press, New York (1992) · doi:10.1145/130385.130401
[2] Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995) · Zbl 0833.62008 · doi:10.1007/978-1-4757-2440-0
[3] Cortes, C., Vapnik, V.: Support-Vector Networks. Mach. Learn. 20, 273–297 (1995) · Zbl 0831.68098
[4] Cristianini, N., Shawe-Taylor, J.: An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, London (2000) · Zbl 0994.68074 · doi:10.1017/CBO9780511801389
[5] Wu, G., Chang, E.Y.: KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans. Knowl. Data Eng. 17, 786–795 (2005) · Zbl 05109801 · doi:10.1109/TKDE.2005.95
[6] Provost, F., Fawcett, T.: Robust classification for imprecise environments. Mach. Learn. 42, 203–231 (2001) · Zbl 0969.68126 · doi:10.1023/A:1007601015854
[7] Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 3358–3378 (2007) · Zbl 1122.68505 · doi:10.1016/j.patcog.2007.04.009
[8] Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences, http://www.ics.uci.edu/mlearn/MLRepository.html
[9] Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002) · Zbl 0994.68128
[10] Tan, P., Steinbach, M.: Introduction to Data Mining. Addison Wesley, Reading (2006)
[11] Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of 14th International Conference on Machine Learning, pp. 179–186. Morgan Kaufmann, San Francisco (1997)
[12] Egan, J.P.: Signal detection theory and ROC analysis. Academic Press, London (1975)
[13] Weiss, G.M.: Mining with rarity: a unifying framework. SIGKDD Explor. Newsl. 6, 7–19 (2004) · Zbl 05442966 · doi:10.1145/1007730.1007734
[14] Karakoulas, G., Shawe-Taylor, J.: Optimizing classifiers for imbalanced training sets. In: Proceedings of Conference on Advances in Neural Information Processing Systems II, pp. 253–259. MIT Press, Cambridge (1999)
[15] Li, Y., Shawe-Taylor, J.: The SVM with uneven margins and Chinese document categorization. In: Proceedings of the 17th Pacific Asia Conference on Language, Information and Computation, pp. 216–227 (2003)
[16] Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 55–60 (1999)
[17] Joachims, T.: Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, Norwell (2002) · doi:10.1007/978-1-4615-0907-3
[18] Cristianini, N., Shawe-Taylor, J., Kandola, J.: On kernel target aligment. In: Proceedings of the Neural Information Processing Systems NIPS 2001, pp. 367–373. MIT Press, Cambridge (2002)
[19] Kandola, J., Shawe-Taylor, J.: Refining kernels for regression and uneven classification problems. In: Proceedings of International Conference on Artificial Intelligence and Statistics. Springer, Heidelberg (2003)
[20] Akbani, R., Kwek, S., Japkowicz, N.: Applying support vector machines to imbalanced datasets. In: Proceedings of European Conference on Machine Learning, pp. 39–50 (2004) · Zbl 1132.68523 · doi:10.1007/978-3-540-30115-8_7
[21] Vilariño, F., Spyridonos, P., Vitri, J., Radeva, P.: Experiments with SVM and stratified sampling with an imbalanced problem: detection of intestinal contractions. In: Proceedings of International Workshop on Pattern Recognition for Crime Prevention, Security and Surveillance, pp. 783–791 (2005) · doi:10.1007/11552499_86
[22] Tang, Y., Zhang, Y.Q., Chawla, N.V., Krasser, S.: SVMs modeling for highly imbalanced classification. IEEE Trans. Syst., Man, Cybern. B 39, 281–288 (2009) · doi:10.1109/TSMCB.2008.2002909
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.