×

VFC-SMOTE: very fast continuous synthetic minority oversampling for evolving data streams. (English) Zbl 1485.68212

Summary: The world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that combines the challenges of class-imbalanced data streams and concept drift. In this paper, we propose the very fast continuous synthetic minority oversampling technique (VFC-SMOTE). It is a novel meta-strategy to be prepended to any streaming machine learning classification algorithm aiming at oversampling the minority class using a new version of Smote and Borderline-Smote inspired by Data Sketching. We benchmarked VFC-SMOTE pipelines on synthetic and real data streams containing different concept drifts, imbalance levels, and class distributions. We bring statistical evidence that VFC-SMOTE pipelines learn models whose minority class performances are better than state-of-the-art. Moreover, we analyze the time/memory consumption and the concept drift recovery speed.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62D05 Sampling theory, sample surveys
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

UCI-ml; MOA; ADASYN; SMOTE
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abdi, L.; Hashemi, S., To combat multi-class imbalanced problems by means of over-sampling and boosting techniques, Soft Comput, 19, 12, 3369-3385 (2015) · doi:10.1007/s00500-014-1291-z
[2] Bellinger, C.; Sharma, S.; Japkowicz, N.; Zaïane, OR, Framework for extreme imbalance classification: SWIM - sampling with the majority class, Knowl Inf Syst, 62, 3, 841-866 (2020) · doi:10.1007/s10115-019-01380-z
[3] Bernardo A, Della Valle E, Bifet A (2020a) Incremental rebalancing learning on evolving data streams. In: Fatta GD, Sheng VS, Cuzzocrea A, Zaniolo C, Wu X (eds) 20th International Conference on Data Mining Workshops, ICDM Workshops 2020, Sorrento, Italy, November 17-20, 2020, IEEE, pp 844-850, doi:10.1109/ICDMW51313.2020.00121
[4] Bernardo A, Gomes HM, Montiel J, Pfahringer B, Bifet A, Della Valle E (2020b) C-SMOTE: continuous synthetic minority oversampling for evolving data streams. In: Wu X, Jermaine C, Xiong L, Hu X, Kotevska O, Lu S, Xu W, Aluru S, Zhai C, Al-Masri E, Chen Z, Saltz J (eds) IEEE International Conference on Big Data, Big Data 2020, Atlanta, GA, USA, December 10-13, 2020, IEEE, pp 483-492, doi:10.1109/BigData50022.2020.9377768
[5] Bifet A, Gavaldà R (2007) Learning from time-changing data with adaptive windowing. In: Proceedings of the Seventh SIAM International Conference on Data Mining, April 26-28, 2007, Minneapolis, Minnesota, USA, SIAM, pp 443-448, doi:10.1137/1.9781611972771.42
[6] Bifet A, Gavaldà R (2009) Adaptive learning from evolving data streams. In: Adams NM, Robardet C, Siebes A, Boulicaut J (eds) Advances in Intelligent Data Analysis VIII, 8th International Symposium on Intelligent Data Analysis, IDA 2009, Lyon, France, August 31 - September 2, 2009 Proceedings, Springer, Lecture Notes in Computer Science, vol 5772, pp 249-260, doi:10.1007/978-3-642-03915-7_22
[7] Bifet, A.; Holmes, G.; Kirkby, R.; Pfahringer, B., MOA: Massive online analysis, J Mach Learn Res, 11, 1601-1604 (2010)
[8] Bifet A, Pfahringer B, Read J, Holmes G (2013a) Efficient data stream classification via probabilistic adaptive windows. In: Shin SY, Maldonado JC (eds) Proceedings of the 28th Annual ACM Symposium on Applied Computing, SAC ’13, Coimbra, Portugal, March 18-22, 2013, ACM, pp 801-806, doi:10.1145/2480362.2480516
[9] Bifet A, Read J, Zliobaite I, Pfahringer B, Holmes G (2013b) Pitfalls in benchmarking data stream classification and how to avoid them. In: Blockeel H, Kersting K, Nijssen S, Zelezný F (eds) Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part I, Springer, Lecture Notes in Computer Science, vol 8188, pp 465-479, doi:10.1007/978-3-642-40988-2_30
[10] Bifet, A.; Gavaldà, R.; Holmes, G.; Pfahringer, B., Machine learning for data streams with practical examples in MOA (2018), Cambridge: MIT Press, Cambridge · doi:10.7551/mitpress/10654.001.0001
[11] Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C., DBSMOTE: density-based synthetic minority over-sampling technique, Appl Intell, 36, 3, 664-684 (2012) · doi:10.1007/s10489-011-0287-y
[12] Chawla, NV; Bowyer, KW; Hall, LO; Kegelmeyer, WP, SMOTE: Synthetic minority over-sampling technique, J Artif Intell Res, 16, 321-357 (2002) · Zbl 0994.68128 · doi:10.1613/jair.953
[13] Cormode, G., Data sketching, Commun ACM, 60, 9, 48-55 (2017) · doi:10.1145/3080008
[14] Demsar, J., Statistical comparisons of classifiers over multiple data sets, J Mach Learn Res, 7, 1-30 (2006) · Zbl 1222.68184
[15] Domingos PM, Hulten G (2000) Mining high-speed data streams. In: Ramakrishnan R, Stolfo SJ, Bayardo RJ, Parsa I (eds) Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, USA, August 20-23, 2000, ACM, pp 71-80, doi:10.1145/347090.347107
[16] Douzas, G.; Bação, F., Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE, Inf Sci, 501, 118-135 (2019) · doi:10.1016/j.ins.2019.06.007
[17] Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml, Accessed 16 June 2021
[18] Fernández A, García S, Herrera F, Chawla NV (2018) SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863-905, doi:10.1613/jair.1.11192 · Zbl 1443.68147
[19] Ferreira LEB, Gomes HM, Bifet A, Oliveira LS (2019) Adaptive random forests with resampling for imbalanced data streams. In: International Joint Conference on Neural Networks, IJCNN 2019 Budapest, Hungary, July 14-19, 2019, IEEE, pp 1-6, doi:10.1109/IJCNN.2019.8852027
[20] Gama J, Medas P, Castillo G, Rodrigues PP (2004) Learning with drift detection. In: Bazzan ALC, Labidi S (eds) Advances in Artificial Intelligence - SBIA 2004, 17th Brazilian Symposium on Artificial Intelligence, São Luis, Maranhão, Brazil, September 29 - October 1, 2004, Proceedings, Springer, Lecture Notes in Computer Science, vol 3171, pp 286-295, doi:10.1007/978-3-540-28645-5_29 · Zbl 1105.68376
[21] Gama, J.; Sebastião, R.; Rodrigues, PP, On evaluating stream learning algorithms, Mach Learn, 90, 3, 317-346 (2013) · Zbl 1260.68329 · doi:10.1007/s10994-012-5320-9
[22] Gama J, Zliobaite I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv 46(4):44:1-44:37, doi:10.1145/2523813 · Zbl 1305.68141
[23] Ghazikhani, A.; Monsefi, R.; Yazdi, HS, Ensemble of online neural networks for non-stationary and imbalanced data streams, Neurocomputing, 122, 535-544 (2013) · doi:10.1016/j.neucom.2013.05.003
[24] Ghazikhani, A.; Monsefi, R.; Yazdi, HS, Recursive least square perceptron model for non-stationary and imbalanced data stream classification, Evol Syst, 4, 2, 119-131 (2013) · doi:10.1007/s12530-013-9076-7
[25] Ghazikhani, A.; Monsefi, R.; Yazdi, HS, Online neural network model for non-stationary and imbalanced data stream classification, Int J Mach Learn Cybern, 5, 1, 51-62 (2014) · doi:10.1007/s13042-013-0180-6
[26] Gomes, HM; Bifet, A.; Read, J.; Barddal, JP; Enembreck, F.; Pfahringer, B.; Holmes, G.; Abdessalem, T., Correction to: adaptive random forests for evolving data stream classification, Mach Learn, 108, 10, 1877-1878 (2019) · Zbl 07097465 · doi:10.1007/s10994-019-05793-3
[27] Grulich PM, Saitenmacher R, Traub J, Breß S, Rabl T, Markl V (2018) Scalable detection of concept drifts on data streams with parallel adaptive windowing. In: Böhlen MH, Pichler R, May N, Rahm E, Wu S, Hose K (eds) Proceedings of the 21st International Conference on Extending Database Technology, EDBT 2018, Vienna, Austria, March 26-29, 2018, OpenProceedings.org, pp 477-480, doi:10.5441/002/edbt.2018.51
[28] Han H, Wang W, Mao B (2005) Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In: Huang D, Zhang XS, Huang G (eds) Advances in Intelligent Computing, International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I, Springer, Lecture Notes in Computer Science, vol 3644, pp 878-887, doi:10.1007/11538059_91
[29] Harries M (1999) SPLICE-2 comparative evaluation: Electricity pricing
[30] He, H.; Garcia, EA, Learning from imbalanced data, IEEE Trans Knowl Data Eng, 21, 9, 1263-1284 (2009) · doi:10.1109/TKDE.2008.239
[31] He H, Bai Y, Garcia EA, Li S (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, part of the IEEE World Congress on Computational Intelligence, WCCI 2008, Hong Kong, China, 1-6, 2008, IEEE, pp 1322-1328, doi:10.1109/IJCNN.2008.4633969
[32] Hulten G, Spencer L, Domingos PM (2001) Mining time-changing data streams. In: Lee D, Schkolnick M, Provost FJ, Srikant R (eds) Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA, August 26-29, 2001, ACM, pp 97-106, doi:10.1145/502512.502529
[33] John GH, Langley P (2013) Estimating continuous distributions in bayesian classifiers. vol abs/1302.4964, arXiv:1302.4964
[34] Kranen, P.; Assent, I.; Baldauf, C.; Seidl, T., The ClusTree: indexing micro-clusters for anytime stream mining, Knowl Inf Syst, 29, 2, 249-272 (2011) · doi:10.1007/s10115-010-0342-8
[35] Li, H.; Shan, M.; Lee, S., DSM-FI: an efficient algorithm for mining frequent itemsets in data streams, Knowl Inf Syst, 17, 1, 79-97 (2008) · doi:10.1007/s10115-007-0112-4
[36] Linhart C, Harari G, Abramovich S, Buchris A (2009) PAKDD data mining competition 2009: New ways of using known methods. In: Theeramunkong T, Nattee C, Adeodato PJL, Chawla NV, Christen P, Lenca P, Poon J, Williams GJ (eds) New Frontiers in Applied Data Mining, PAKDD 2009 International Workshops, Bangkok, Thailand, April 27-30, 2009. Revised Selected Papers, Springer, Lecture Notes in Computer Science, vol 5669, pp 99-105, doi:10.1007/978-3-642-14640-4_7
[37] Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G (2020) Learning under concept drift: A review. CoRR abs/2004.05785, arXiv:2004.05785
[38] Ma S, Li X, Ding Y, Orlowska ME (2007) A recommender system with interest-drifting. In: Benatallah B, Casati F, Georgakopoulos D, Bartolini C, Sadiq W, Godart C (eds) Web Information Systems Engineering - WISE 2007, 8th International Conference on Web Information Systems Engineering, Nancy, France, December 3-7, 2007, Proceedings, Springer, Lecture Notes in Computer Science, vol 4831, pp 633-642, doi:10.1007/978-3-540-76993-4_55
[39] Mirza, B.; Lin, Z.; Liu, N., Ensemble of subset online sequential extreme learning machine for class imbalance and concept drift, Neurocomputing, 149, 316-329 (2015) · doi:10.1016/j.neucom.2014.03.075
[40] Oza NC (2005) Online bagging and boosting. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Waikoloa, Hawaii, USA, October 10-12, 2005, IEEE, pp 2340-2345, doi:10.1109/ICSMC.2005.1571498
[41] Pozzolo, AD; Boracchi, G.; Caelen, O.; Alippi, C.; Bontempi, G., Credit card fraud detection: A realistic modeling and a novel learning strategy, IEEE Trans Neural Networks Learn Syst, 29, 8, 3784-3797 (2018) · doi:10.1109/TNNLS.2017.2736643
[42] Street WN, Kim Y (2001) A streaming ensemble algorithm (SEA) for large-scale classification. In: Lee D, Schkolnick M, Provost FJ, Srikant R (eds) Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA, 2001, ACM, pp 377-382, doi:10.1145/502512.502568
[43] Tsymbal, A., The problem of concept drift: definitions and related work, Comput Sci Dep Trinity College Dublin, 106, 2, 58 (2004)
[44] Wang, B.; Pineau, J., Online bagging and boosting for imbalanced data streams, IEEE Trans Knowl Data Eng, 28, 12, 3353-3366 (2016) · doi:10.1109/TKDE.2016.2609424
[45] Wang S, Minku LL, Yao X (2013) A learning framework for online class imbalance learning. In: Proceedings of the IEEE Symposium on Computational Intelligence and Ensemble Learning, CIEL 2013, IEEE Symposium Series on Computational Intelligence (SSCI), 16-19 April 2013, Singapore, IEEE, pp 36-45, doi:10.1109/CIEL.2013.6613138
[46] Wang, S.; Minku, LL; Yao, X., Resampling-based ensemble methods for online class imbalance learning, IEEE Trans Knowl Data Eng, 27, 5, 1356-1368 (2015) · doi:10.1109/TKDE.2014.2345380
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.