×

The effect of imbalanced data sets on LDA: a theoretical and empirical analysis. (English) Zbl 1118.68129

Summary: This paper demonstrates that the imbalanced data sets have a negative effect on the performance of LDA theoretically. This theoretical analysis is confirmed by the experimental results: using several sampling methods to rebalance the imbalanced data sets, it is found that the performances of LDA on balanced data sets are superior to those of LDA on imbalanced data sets.

MSC:

68T05 Learning and adaptive systems in artificial intelligence

Software:

SMOTE; UCI-ml
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Jain, A. K.; Duin, R. P.W.; Mao, J., Statistical pattern recognition: a review, IEEE Trans. Pattern Anal. Mech. Intell., 22, 1, 4-37 (2000)
[2] Chawla, N. V.; Japkowicz, N.; Kolcz, A., Special issue on learning from imbalanced data sets, ACM SIGKDD Explorations, 6, 1 (2004)
[3] N.V. Chawla, N. Japkowicz, A. Kolcz (Eds.), Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets, 2003.; N.V. Chawla, N. Japkowicz, A. Kolcz (Eds.), Proceedings of the ICML’2003 Workshop on Learning from Imbalanced Data Sets, 2003.
[4] N. Japkowica (Ed.), Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, AAAI Technical Report WS-00-05, AAAI, 2000.; N. Japkowica (Ed.), Proceedings of the AAAI’2000 Workshop on Learning from Imbalanced Data Sets, AAAI Technical Report WS-00-05, AAAI, 2000.
[5] McLachlan, G. J., Discriminant Analysis and Statistical Pattern Recognition (1992), Wiley: Wiley New York
[6] Bradley, A. P., The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition, 30, 7, 1145-1159 (1997)
[7] Hanley, J. A.; McNeil, B. J., The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology, 143, 29-36 (1982)
[8] C.L. Blake, C.J. Merz, UCI repository of machine learning database, \( \langle;\) http://www.ics.uci.edu/ mlearn/MLRepository.html/\( \rangle;\); C.L. Blake, C.J. Merz, UCI repository of machine learning database, \( \langle;\) http://www.ics.uci.edu/ mlearn/MLRepository.html/\( \rangle;\)
[9] Tomek, I., Two modifications of CNN, IEEE Trans. Syst. Man Commun., SMC-6, 769-772 (1976) · Zbl 0341.68066
[10] Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P., SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res., 16, 321-357 (2002) · Zbl 0994.68128
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.