An instance-based learning recommendation algorithm of imbalance handling methods. (English) Zbl 1428.68268

Summary: Imbalance learning is a typical problem in domain of machine learning and data mining. Aiming to solve this problem, researchers have proposed lots of the state-of-art techniques, such as Over Sampling, Under Sampling, SMOTE, Cost sensitive, and so on. However, the most appropriate methods on different learning problems are diverse. Given an imbalance learning problem, we proposed an Instance-based Learning (IBL) recommendation algorithm to present the most appropriate imbalance handling method for it. First, the meta knowledge database is created by the binary relation (data characteristic measures-the rank of all candidate imbalance handling methods) of each data set. Afterwards, when a new data set comes, its characteristics will be extracted and compared with the example in the knowledge database, where the instance-based \(k\)-nearest neighbors algorithm is applied to identify the rank of all candidate imbalance handling methods for the new dataset. Finally, the most appropriate imbalance handling method will be derived through combining the recommended rank and individual bias. The experimental results on 80 public binary imbalance datasets confirm that the proposed recommendation algorithm can effectively present the most appropriate imbalance handling method for a given imbalance learning problem, with the hit rate of recommendation up to 95%.


68T05 Learning and adaptive systems in artificial intelligence
62G07 Density estimation
Full Text: DOI


[1] Kubat, M.; Holte, R. C.; Matwin, S., Machine learning for the detection of oil spills in satellite radar images[J], Mach. Learn., 30, 2-3, 195-215 (1998)
[2] Ngai, E. W.T.; Hu, Y.; Wong, Y. H., The application of data mining techniques in financial fraud detection: a classification framework and an academic review of literature[J], Decis. Support Syst., 50, 3, 559-569 (2011)
[3] Khoshgoftaar, T. M.; Gao, K.; Seliya, N., Attribute selection and imbalanced data: problems in software defect prediction[C], (Proceedings of the IEEE International Conference on Tools with Artificial Intelligence, 1 (2010)), 137-144
[4] Sommer, R.; Paxson, V., Outside the closed world: on using machine learning for network intrusion detection[C], (Proceedings of the IEEE Symposium on Security and Privacy (2010)), 305-316
[5] Japkowicz, N., Learning from imbalanced data sets: a comparison of various strategies[C], (Proceedings of the AAAI workshop on Learning from Imbalanced Data Sets (2000)), 10-15
[6] Japkowicz, N., Class imbalances: are we focusing on the right issue[C], (Proceedings of the Workshop on Learning from Imbalanced Data Sets II, 1723 (2003)), 63
[7] Chawla, N. V.; Japkowicz, N.; Kotcz, A., Editorial: special issue on learning from imbalanced data sets[J], ACM SIGKDD Explor. Newslett., 6, 1, 1-6 (2004)
[8] Batista, G. E.; Prati, R. C.; Monard, M. C., A study of the behavior of several methods for balancing machine learning training data[J], ACM SIGKDD Explor. Newslett., 6, 1, 20-29 (2004)
[9] Japkowicz, N.; Stephen, S., The class imbalance problem: A systematic study[J], Intell. Data Anal., 6, 5, 429-449 (2002) · Zbl 1085.68628
[10] Weiss, G. M., Mining with rarity: a unifying framework[J], SIGKDD Explor., 6, 1, 7-19 (2004)
[11] Sun, Z.; Song, Q.; Zhu, X., Using coding-based ensemble learning to improve software defect prediction[J], IEEE Trans. Syst. Man Cybernet. Part C Appl. Rev., 42, 6, 1806-1817 (2012)
[12] Guo, X.; Yin, Y.; Dong, C., On the class imbalance problem[C], (Proceedings of the Fourth International Conference on Natural Computation (2008)), 192-201
[13] Chawla, N. V., Data mining for imbalanced datasets: an overview[M], Data Mining and Knowledge Discovery Handbook, 853-867 (2005), Springer
[14] He, H.; Garcia, E. A., Learning from imbalanced data[J], IEEE Trans. Knowl. Data Eng., 21, 9, 1263-1284 (2009)
[16] Michie, D.; Spiegelhalter, D. J.; Taylor, C. C., Machine learning, neural and statistical classification[M], Technometrics (1995)
[17] Brazdil, P.; Gama Ja, o.; Henery, B., Characterizing the applicability of classification algorithms using meta-level learning[C], (Proceedings of the European Conference on Machine Learning (1994)), 83-102
[18] Song, Q.; Wang, G.; Wang, C., Automatic recommendation of classification algorithms based on data set characteristics[J], Pattern Recognit., 45, 7, 2672-2689 (2012)
[19] Keller, J.; Paterson, I.; Berrer, H., An integrated concept for multi-criteria ranking of data-mining algorithms[C], (Keller, J.; Giraud-Carrier, C., Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination (2000))
[20] Brazdil, P. B.; Soares, C.; Da Costa, J. P., Ranking learning algorithms: using IBL and meta-learning on accuracy and time results[J], Mach. Learn., 50, 3, 251-277 (2003) · Zbl 1033.68082
[21] Brazdil, P. B.; Soares, C., A comparison of ranking methods for classification algorithm selection[M], (Proceedings of the Conference on Machine Learning: ECML (2000), Springer), 63-75, 2000
[22] Fix, E.; Hodges, J. L., Discriminatory Analysis-Nonparametric Discrimination: Consistency Properties[R] (1951), California University, Berkeley · Zbl 0715.62080
[23] Ali, S.; Smith, K. A., On learning algorithm selection for classification[J], Appl. Soft Comput., 6, 2, 119-138 (2006)
[24] Ho, T. K.; Basu, M., Complexity measures of supervised classification problems[J], IEEE Trans. Pattern Anal. Mach. Intell., 24, 3, 289-300 (2002)
[25] Bensusan, H.; Giraud-Carrier, C., Discovering task neighbourhoods through landmark learning performances[M], Principles of Data Mining and Knowledge Discovery, 325-330 (2000), Springer
[26] Pfahringer, B.; Bensusan, H.; Giraud-Carrier, C., Tell me who can learn you and i can tell you who you are: landmarking various learning algorithms[C], (Proceedings of the Seventeenth International Conference on Machine Learning (2000)), 743-750
[27] Duin, R. P.W.; Pekalska, E.; Tax, D. M.J., The characterization of classification problems by classifier disagreements[C], (Proceedings of the Seventeenth International Conference on Pattern Recognition (2004)), 141-143
[28] Peng, Y.; Flach, P. A.; Brazdil, P., Decision tree-based data characterization for meta-learning[J], IDDM, 111 (2002)
[29] Fawcett, T., An introduction to ROC analysis[J], Pattern Recognit. Lett., 27, 8, 861-874 (2006)
[30] Atkeson, C. G.; Moore, A. W.; Schaal, S., Locally weighted learning for control, Lazy Learning, 75-113 (1997), Springer
[32] Alcalá, J.; Fernández, A.; Luengo, J., Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework[J], J. Multiple Valued Logic Soft Comput., 17, 255-287 (2010)
[33] Boetticher, G.; Menzies, T.; Ostrand, T., Promise Repository of Empirical Software Engineering Data (2007), Department of Computer Science, West Virginia University
[34] Kotsiantis, S. B.; Pintelas, P. E., Mixture of expert agents for handling imbalanced data sets[J], Ann. Math. Comput. Teleinf., 1, 1, 46-55 (2003)
[35] Kubat, M.; Matwin, S., Addressing the curse of imbalanced training sets: one-sided selection[C], (Proceedings of the International Conference on Machine Learning (1997)), 179-186
[36] Chawla, N. V.; Bowyer, K. W.; Hall, L. O., SMOTE: synthetic minority oversampling technique[J], J. Artif. Intell. Res., 16, 321-357 (2002) · Zbl 0994.68128
[37] Domingos, P., Metacost: a general method for making classifiers cost-sensitive[C], (Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery And Data Mining (1999)), 155-164
[38] Breiman, L., Bagging predictors[J], Mach. Learn., 24, 2, 123-140 (1996) · Zbl 0858.68080
[39] Joshi, M. V.; Kumar, V.; Agarwal, R. C., Evaluating boosting algorithms to classify rare classes: comparison and improvements[C], (Proceedings of the IEEE International Conference on Data Mining (2001)), 257-264
[40] Chawla, N. V.; Lazarevic, A.; Hall, L. O., SMOTEBoost: improving prediction of the minority class in boosting[C], Knowledge Discovery in Databases, 107-119 (2003), Springer: Springer Berlin Heidelberg
[41] Liu, X.-Y.; Wu, J.; Zhou, Z.-H., Exploratory undersampling for class-imbalance learning[J], IEEE Trans. Syst. Man Cybernet. Part B Cybernet., 39, 2, 539-550 (2009)
[42] Barandela, R.; Valdovinos, R. M.; Sánchez, J. S., New applications of ensembles of classifiers[J], Pattern Anal. Appl., 6, 3, 245-256 (2003)
[43] Akkasi, A.; Varoğlu, E.; Dimililer, N., Balanced undersampling: a novel sentence-based undersampling method to improve recognition of named entities in chemical and biomedical text[J], Appl. Intell., 48, 8, 1965-1978 (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.