Imbalanced classification in sparse and large behaviour datasets. (English) Zbl 1411.68124

Summary: Recent years have witnessed a growing number of publications dealing with the imbalanced learning issue. While a plethora of techniques have been investigated on traditional low-dimensional data, little is known on the effect thereof on behaviour data. This kind of data reflects fine-grained behaviours of individuals or organisations and is characterized by sparseness and very large dimensions. In this article, we investigate the effects of several over-and undersampling, cost-sensitive learning and boosting techniques on the problem of learning from imbalanced behaviour data. Oversampling techniques show a good overall performance and do not seem to suffer from overfitting as traditional studies report. A variety of undersampling approaches are investigated as well and show the performance degrading effect of instances showing odd behaviour. Furthermore, the boosting process indicates that the regularization parameter in the SVM formulation acts as a weakness indicator and that a combination of weak learners can often achieve better generalization than a single strong learner. Finally, the EasyEnsemble technique is presented as the method outperforming all others. By randomly sampling several balanced subsets, feeding them to a boosting process and subsequently combining their hypotheses, a classifier is obtained that achieves noise/outlier reduction effects and simultaneously explores the majority class space efficiently. Furthermore, the method is very fast since it is parallelizable and each subset is only twice as large as the minority class size.


68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI


[1] Akbani R, Kwek S, Japkowicz N (2004) Applying support vector machines to imbalanced datasets. In: Machine learning: ECML 2004: 15th European conference on machine learning, Pisa, Italy, September 20-24, 2004. Proceedings, Springer, Berlin, pp 39-50. doi:10.1007/978-3-540-30115-8_7 · Zbl 1132.68523
[2] Ali, A.; Shamsuddin, SM; Ralescu, AL, Classification with class imbalance problem: a review, Int J Adv Soft Comput Appl, 7, 176-204, (2015)
[3] Alzahrani T, Horadam KJ (2016) Community detection in bipartite networks: algorithms and case studies. In: Complex systems and networks: dynamics, controls and applications. Springer, Berlin, pp 25-50. doi:10.1007/978-3-662-47824-0_2
[4] Bachner J (2013) Predictive policing: preventing crime with data and analytics. IBM Center for the Business of Government
[5] Baesens, B.; Gestel, T.; Viaene, S.; Stepanova, M.; Suykens, J.; Vanthienen, J., Benchmarking state-of-the-art classification algorithms for credit scoring, J Oper Res Soc, 54, 627-635, (2003) · Zbl 1097.91516
[6] Barandela, R.; Snchez, J.; Garca, V.; Rangel, E., Strategies for learning in class imbalance problems, Pattern Recognit, 36, 849-851, (2003)
[7] Barber, MJ, Modularity and community detection in bipartite networks, Phys Rev E, 76, 102, (2007)
[8] Barua, S.; Islam, MM; Yao, X.; Murase, K., MWMOTE-majority weighted minority oversampling technique for imbalanced data set learning, IEEE Trans Knowl Data Eng, 26, 405-425, (2014)
[9] Batista, GEAPA; Prati, RC; Monard, MC, A study of the behavior of several methods for balancing machine learning training data, SIGKDD Explor Newsl, 6, 20-29, (2004)
[10] Beckett SJ (2016) Improved community detection in weighted bipartite networks. R Soc Open Sci 3(1). doi:10.1098/rsos.140536
[11] Bekkar, M.; Djemaa, HK; Alitouche, TA, Evaluation measures for models assessment over imbalanced data sets, J Inf Eng Appl, 3, 27-38, (2013)
[12] Bhattacharyya, S.; Jha, S.; Tharakunnel, K.; Westland, JC, Data mining for credit card fraud: a comparative study, Decis Support Syst, 50, 602-613, (2011)
[13] Blondel, VD; Guillaume, JL; Lambiotte, R.; Lefebvre, E., Fast unfolding of communities in large networks, J Stat Mech Theory Exp, 10, p10,008, (2008)
[14] Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classif Regres Trees. Taylor & Francis, London
[15] Brozovsky L, Petricek V (2007) Recommender system for online dating service. In: Proceedings of Znalosti 2007 Conference, VSB, Ostrava
[16] Cha M, Mislove A, Gummadi KP (2009) A measurement-driven analysis of information propagation in the Flickr social network. In: Proceedings of the 18th international conference on world wide web, ACM, New York. WWW ’09, pp 721-730. doi:10.1145/1526709.1526806
[17] Chawla, NV; Maimon, O. (ed.); Rokach, L. (ed.), Data mining for imbalanced datasets: an overview, 853-867, (2005), Boston
[18] Chawla, NV; Bowyer, KW; Hall, LO; Kegelmeyer, WP, SMOTE: synthetic minority over-sampling technique, J Artif Intell Res, 16, 321-357, (2002) · Zbl 0994.68128
[19] Chawla, NV; Lazarevic, A.; Hall, LO; Bowyer, KW; Lavrač, N. (ed.); Gamberger, D. (ed.); Todorovski, L. (ed.); Blockeel, H. (ed.), Smoteboost: improving prediction of the minority class in boosting, 107-119, (2003), Berlin
[20] Chawla, NV; Japkowicz, N.; Kotcz, A., Editorial: special issue on learning from imbalanced data sets, SIGKDD Explor Newsl, 6, 1-6, (2004)
[21] Chen, M.; Mao, S.; Liu, Y., Big data: a survey, Mob Netw Appl, 19, 171-209, (2014)
[22] Chyi YM (2003) Classification analysis techniques for skewed class distribution problems. Master thesis, Department of Information management, National Sun Yat-Sen University
[23] Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1-30
[24] Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, KDD ’01, pp 269-274. doi:10.1145/502512.502550
[25] Drummond C, Holte RC (2003) C4.5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In: Proceedings of the ICML ’03 Workshop on learning from imbalanced datasets
[26] Fan, RE; Chang, KW; Hsieh, CJ; Wang, XR; Lin, CJ, LIBLINEAR: a library for large linear classification, J Mach Learn Res, 9, 1871-1874, (2008) · Zbl 1225.68175
[27] Fan W, Stolfo SJ, Zhang J, Chan PK (1999) AdaCost: Misclassification cost-sensitive boosting. In: Proceedings of the sixteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, ICML ’99, pp 97-105
[28] Fawcett, T., An introduction to ROC analysis, Pattern Recognit Lett, 27, 861-874, (2006)
[29] Finch, H., Comparison of distance measures in cluster analysis with dichotomous data, J Data Sci, 3, 85-100, (2005)
[30] Fortunato, S., Community detection in graphs, Phys Rep, 486, 75-174, (2010)
[31] Frasca, M.; Bertoni, A.; Re, M.; Valentini, G., A neural network algorithm for semi-supervised node label learning from unbalanced data, Neural Netw, 43, 84-98, (2013) · Zbl 1293.68222
[32] Friedman, M., The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J Am Stat Assoc, 32, 675-701, (1937) · JFM 63.1098.02
[33] García E, Lozano F (2007) Boosting support vector machines. In: Machine learning and data mining in pattern recognition, 5th international conference, MLDM 2007, Leipzig, Germany, July 18-20, Post Proceedings, IBaI Publishing, pp 153-167
[34] Goldstein, M.; Uchida, S., A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data, PLoS ONE, 11, 1-31, (2016)
[35] Gonzlez, PC; Velsquez, JD, Characterization and detection of taxpayers with false invoices using data mining techniques, Exp Syst Appl, 40, 1427-1436, (2013)
[36] Guimerà, R.; Sales-Pardo, M.; Amaral, LAN, Module identification in bipartite and directed networks, Phys Rev E, 76, 102, (2007)
[37] Guo, H.; Viktor, HL, Learning from imbalanced data sets with boosting and data generation: the DataBoost-IM approach, SIGKDD Explor Newsl, 6, 30-39, (2004)
[38] Guo X, Yin Y, Dong C, Yang G, Zhou G (2008) On the class imbalance problem. In: 2008 fourth international conference on natural computation, IEEE, vol 4, pp 192-201. doi:10.1109/ICNC.2008.871
[39] Han, H.; Wang, WY; Mao, BH; Huang, D. (ed.); Zhang, X-P (ed.); Huang, G-B (ed.), Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning, 878-887, (2005), Berlin
[40] He, H.; Garcia, EA, Learning from imbalanced data, IEEE Trans Knowl Data Eng, 21, 1263-1284, (2009)
[41] He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE, pp 1322-1328. doi:10.1109/IJCNN.2008.4633969
[42] Holm, S., A simple sequentially rejective multiple test procedure, Scand J Stat, 6, 65-70, (1979) · Zbl 0402.62058
[43] Hsu, CW; Lin, CJ, A comparison of methods for multiclass support vector machines, IEEE Trans Neural Netw, 13, 415-425, (2002)
[44] Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008). Christchurch, New Zealand, pp 49-56
[45] Iman, RL; Davenport, JM, Approximations of the critical region of the Friedman statistic, Commun Stat Theory Methods, 9, 571-595, (1980) · Zbl 0451.62061
[46] Jo, T.; Japkowicz, N., Class imbalances versus small disjuncts, ACM SIGKDD Explor Newsl, 6, 40-49, (2004)
[47] Junqué de Fortuny, E.; Martens, D.; Provost, F., Predictive modeling with big data: is bigger really better?, Big Data, 1, 215-226, (2014)
[48] Junqué de Fortuny E, Stankova M, Moeyersoms J, Minnaert B, Provost F, Martens D (2014b) Corporate residence fraud detection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, KDD ’14, pp 1650-1659. doi:10.1145/2623330.2623333
[49] Jutla IS, Jeub LG, Mucha PJ (2011-2016) A generalized louvain method for community detection implemented in MATLAB. http://netwiki.amath.unc.edu/GenLouvain
[50] Kubat M, Matwin S (1997) Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the fourteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 179-186
[51] Lancichinetti, A.; Fortunato, S., Community detection algorithms: a comparative analysis, Phys Rev E, 80, 117, (2009)
[52] Larremore, DB; Clauset, A.; Jacobs, AZ, Efficiently inferring community structure in bipartite networks, Phys Rev E Stat Nonlinear Soft Matter Phys, 90, 805, (2014)
[53] Li, J.; Fine, JP, Weighted area under the receiver operating characteristic curve and its application to gene selection, J R Stat Soc Series C (Appl Stat), 59, 673-692, (2010)
[54] Li, X.; Wang, L.; Sung, E., AdaBoost with SVM-based component classifiers, Eng Appl Artif Intell, 21, 785-795, (2008)
[55] Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
[56] Liu, W.; Chawla, S.; Cieslak, DA; Chawla, NV, A robust decision tree algorithm for imbalanced data sets, Proc Tenth SIAM Int Conf Data Mining SIAM Phila, 10, 766-777, (2010)
[57] Liu, XY; Wu, J.; Zhou, ZH, Exploratory undersampling for class-imbalance learning, IEEE Trans Syst Man Cybern B (Cybern), 39, 539-550, (2009)
[58] Luts, J.; Ojeda, F.; Plas, R.; Moor, B.; Huffel, S.; Suykens, JA, A tutorial on support vector machine-based methods for classification problems in chemometrics, Anal Chim Acta, 665, 129-145, (2010)
[59] Macskassy, SA; Provost, F., Classification in networked data: a toolkit and a univariate case study, J Mach Learn Res, 8, 935-983, (2007)
[60] Martens, D.; Provost, F., Explaining data-driven document classifications, MIS Q, 38, 73-100, (2014)
[61] Martens, D.; Provost, F.; Clark, J.; Junqué de Fortuny, E., Mining massive fine-grained behavior data to improve predictive analytics, MIS Q, 40, 869-888, (2016)
[62] Mazurowski, MA; Habas, PA; Zurada, JM; Lo, JY; Baker, JA; Tourassi, GD, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Netw, 21, 427-436, (2008)
[63] Mease, D.; Wyner, AJ; Buja, A., Boosted classification trees and class probability/quantile estimation, J Mach Learn Res, 8, 409-439, (2007) · Zbl 1222.68261
[64] Nemenyi P (1963) Distribution-free multiple comparisons. Dissertation, Princeton University
[65] Newman, MEJ; Girvan, M., Finding and evaluating community structure in networks, Phys Rev E, 69, 113, (2004)
[66] Ng AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA, ICML ’04, p 78. doi:10.1145/1015330.1015435
[67] Ng AY, Jordan MI (2002) On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich TG, Becker S, Ghahramani Z (eds) Advances in neural information processing systems 14. MIT Press, pp 841-848
[68] Ngai, E.; Hu, Y.; Wong, Y.; Chen, Y.; Sun, X., The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature, Decis Support Syst, 50, 559-569, (2011)
[69] Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett P, Schoelkopf B, Schuurmans D (eds) Advances in large-margin classifiers. MIT Press, pp 61-74
[70] Porter, MA; Onnela, JP; Mucha, PJ, Communities in networks, Not Am Math Soc, 56, 1082-1097, (2009) · Zbl 1188.05142
[71] Provost F, Fawcett T (2013) Data science for business: what you need to know about data mining and data-analytic thinking. O’Reilly Media, Inc
[72] Provost F, Dalessandro B, Hook R, Zhang X, Murray A (2009) Audience selection for on-line brand advertising: privacy-friendly social network targeting. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’09, pp 707-716. doi:10.1145/1557019.1557098
[73] Raskutti, B.; Kowalczyk, A., Extreme re-balancing for SVMs: a case study, SIGKDD Explor Newsl, 6, 60-69, (2004)
[74] Rosvall, M.; Bergstrom, CT, Maps of random walks on complex networks reveal community structure, Proc Natl Acad Sci, 105, 1118-1123, (2008)
[75] Schapire RE (1999) A brief introduction to boosting. In: Proceedings of the 16th international joint conference on artificial intelligence—volume 2. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, IJCAI’99, pp 1401-1406
[76] Schapire, RE; Singer, Y., Improved boosting algorithms using confidence-rated predictions, Mach Learn, 37, 297-336, (1999) · Zbl 0945.68194
[77] Shmueli, G., Analyzing behavioral big data: methodological, practical, ethical, and moral issues, Qual Eng, 29, 57-74, (2017)
[78] Sobhani P, Viktor H, Matwin S (2015) Learning from imbalanced data using ensemble methods and cluster-based undersampling. In: New frontiers in mining complex patterns: third international workshop, NFMCP 2014, Held in Conjunction with ECML-PKDD 2014, Nancy, France, September 19, 2014, Revised Selected Papers, Springer International Publishing, Cham, pp 69-83. doi:10.1007/978-3-319-17876-9_5
[79] Stankova M (2016) Classification within network data with a bipartite structure. Dissertation, University of Antwerp
[80] Stankova M, Martens D, Provost F (2015) Classification over bipartite graphs through projection. Working papers 2015001, University of Antwerp, Faculty of Applied Economics
[81] Sun, Y.; Kamel, MS; Wong, AK; Wang, Y., Cost-sensitive boosting for classification of imbalanced data, Pattern Recognit, 40, 3358-3378, (2007) · Zbl 1122.68505
[82] Suykens JA, Van Gestel T, De Brabanter J, De Moor B, Vandewalle J, Suykens J, Van Gestel T (2002) Least squares support vector machines. World Scientific, Singapore · Zbl 1017.93004
[83] Tang, B.; He, H., Enn: Extended nearest neighbor method for pattern recognition [research frontier], IEEE Comput Intell Mag, 10, 52-60, (2015)
[84] Tang, Y.; Zhang, YQ; Chawla, NV; Krasser, S., SVMs modeling for highly imbalanced classification, IEEE Trans Syst Man Cybern B (Cybern), 39, 281-288, (2009)
[85] Tobback E, Moeyersoms J, Stankova M, Martens D (2016) Bankruptcy prediction for SMEs using relational data. Working paper 2016004, University of Antwerp, Faculty of Applied Economics
[86] Verbeke, W.; Dejaeger, K.; Martens, D.; Hur, J.; Baesens, B., New insights into churn prediction in the telecommunication sector: a profit driven data mining approach, Eur J Oper Res, 218, 211-229, (2012)
[87] Veropoulos K, Campbell I, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence, Stockholm, Sweden (IJCAI99), pp 55-60
[88] Whitrow, C.; Hand, DJ; Juszczak, P.; Weston, D.; Adams, NM, Transaction aggregation as a strategy for credit card fraud detection, Data Min Knowl Discov, 18, 30-55, (2009)
[89] Wickramaratna J, Holden SB, Buxton BF (2001) Performance degradation in boosting. In: Proceedings of the second international workshop on multiple classifier systems, Springer, London, UK, MCS ’01, pp 11-21 · Zbl 0980.68780
[90] Yen, SJ; Lee, YS, Cluster-based under-sampling approaches for imbalanced data distributions, Exp Syst Appl, 36, 5718-5727, (2009)
[91] Yu HF, Lo HY, Hsieh HP, Lou JK, McKenzie TG, Chou JW, Chung PH, Ho CH, Chang CF, Wei YH, et al (2010) Feature engineering and classifier ensemble for kdd cup 2010. In: Proceedings of the KDD Cup 2010 Workshop, pp 1-16
[92] Zha H, He X, Ding C, Simon H, Gu M (2001) Bipartite graph partitioning and data clustering. In: Proceedings of the tenth international conference on information and knowledge management, ACM, New York, NY, USA, CIKM ’01, pp 25-32. doi:10.1145/502585.502591
[93] Zhang J, Mani I (2003) Knn approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of the ICML’2003 workshop on learning from imbalanced datasets, Washington DC
[94] Ziegler CN, McNee SM, Konstan JA, Lausen G (2005) Improving recommendation lists through topic diversification. In: Proceedings of the 14th international conference on world wide web, ACM, New York, NY, USA, WWW ’05, pp 22-32. doi:10.1145/1060745.1060754
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.