New hard-thresholding rules based on data splitting in high-dimensional imbalanced classification. (English) Zbl 1493.62392

Summary: In binary classification, imbalance refers to situations in which one class is heavily under-represented. This issue is due to either a data collection process or because one class is indeed rare in a population. Imbalanced classification frequently arises in applications such as biology, medicine, engineering, and social sciences. In this paper, for the first time, we theoretically study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions. We show that due to data scarcity in one class, referred to as the minority class, and high-dimensionality of the feature space, the LDA ignores the minority class yielding a maximum misclassification rate. We then propose a new construction of hard-thresholding rules based on a data splitting technique that reduces the large difference between the misclassification rates. We show that the proposed method is asymptotically optimal. We further study two well-known sparse versions of the LDA in imbalanced cases. We evaluate the finite-sample performance of different methods using simulations and by analyzing two real data sets. The results show that our method either outperforms its competitors or has comparable performance based on a much smaller subset of selected features, while being computationally more efficient.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI arXiv Link


[1] Ahn, J. and J. Marron (2010). The maximal data piling direction for discrimination. Biometrika 97, 254-259. · Zbl 1182.62134
[2] Bach, M., A. Werner, J. Żywiec, and W. Pluskiewicz (2017). The study of under-and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Information Sciences 384, 174-190.
[3] Bak, B. A. and J. L. Jensen (2016). High dimensional classifiers in the imbalanced case. Computational Statistics & Data Analysis 98, 46-59. · Zbl 1468.62021
[4] Bickel, P. J. and E. Levina (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 10, 989-1010. · Zbl 1064.62073
[5] Bickel, P. J. and E. Levina (2008a). Covariance regularization by thresholding. Annals of Statistics 36, 2577-2604. · Zbl 1196.62062
[6] Bickel, P. J. and E. Levina (2008b). Regularized estimation of large covariance matrices. Annals of Statistics 36, 199-227. · Zbl 1132.62040
[7] Blagus, R. and L. Lusa (2010). Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics 11, 1-17.
[8] Blagus, R. and L. Lusa (2013). SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics 14. · Zbl 1464.62029
[9] Bolton, R. J. and D. J. Hand (2002). Statistical fraud detection: A review. Statistical science 17, 235-249. · Zbl 1013.62115
[10] Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321-357. · Zbl 0994.68128
[11] Delaigle, A. and P. Hall (2012). Effect of heavy tails on ultra high dimensional variable ranking methods. Statistica Sinica 22, 909-932. · Zbl 1257.62057
[12] Fan, J. and Y. Fan (2008). High dimensional classification using features annealed independence rules. Annals of Statistics 36, 2605—-2637. · Zbl 1360.62327
[13] Fan, J., Y. Feng, and X. Tong (2012). A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74, 745-771. · Zbl 1411.62167
[14] Fan, J. and J. Lv (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 849-911. · Zbl 1411.62187
[15] Feng, Y., M. Zhou, and X. Tong (2020). Imbalanced classification: an objective-oriented review. arXiv preprint arXiv:2002.04592.
[16] Gaynanova, I., M. Kolar, et al. (2015). Optimal variable selection in multi-group sparse discriminant analysis. Electronic Journal of Statistics 9, 2007-2034. · Zbl 1323.62060
[17] Gravier, E., G. Pierron, A. Vincent-Salomon, N. Gruel, V. Raynal, A. Savignoni, Y. De Rycke, J.-Y. Pierga, C. Lucchesi, F. Reyal, A. Fourquet, S. Roman-Roman, X. Radvanyi, François aand Sastre-Garau, B. Asselain, and O. Delattre (2010). A prognostic DNA signature for T1T2 node-negative breast cancer patients. Genes, Chromosomes and Cancer 49, 1125-1134.
[18] Guo, Y., T. Hastie, and R. Tibshirani (2006). Regularized linear discriminant analysis and its application in microarrays. Biostatistics 8, 86-100. · Zbl 1170.62382
[19] Hall, P., J. S. Marron, and A. Neeman (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 427-444. · Zbl 1069.62097
[20] Huang, S., T. Tong, and H. Zhao (2010). Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics 66, 1096-1106. · Zbl 1233.62130
[21] Iranmehr, A., H. Masnadi-Shirazi, and N. Vasconcelos (2019). Cost-sensitive support vector machines. Neurocomputing 343, 50-64.
[22] Li, Q. and J. Shao (2015). Sparse quadratic discriminant analysis for high dimensional data. Statistica Sinica 25, 457-473. · Zbl 06503804
[23] Li, Y., H. G. Hong, and Y. Li (2019). Multiclass linear discriminant analysis with ultrahigh-dimensional features. Biometrics 75, 1086-1097. · Zbl 1448.62176
[24] Meinshausen, N. and P. Bühlmann (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 417-473. · Zbl 1411.62142
[25] Meinshausen, N., L. Meier, and P. Bühlmann (2009). P-values for high-dimensional regression. Journal of the American Statistical Association 104, 1671-1681. · Zbl 1205.62089
[26] Nakayama, Y. (2020). Support vector machine and optimal parameter selection for high-dimensional imbalanced data. Communications in Statistics-Simulation and Computation, 1-16.
[27] Nakayama, Y., K. Yata, and M. Aoshima (2017). Support vector machine and its bias correction in high-dimension, low-sample-size settings. Journal of Statistical Planning and Inference 191, 88-100. · Zbl 1381.62191
[28] Owen, A. B. (2007). Infinitely imbalanced logistic regression. Journal of Machine Learning Research 8, 761-773. · Zbl 1222.62094
[29] Pan, R., H. Wang, and R. Li (2016). Ultrahigh-dimensional multiclass linear discriminant analysis by pairwise sure independence screening. Journal of the American Statistical Association 111, 169-179.
[30] Pang, H., T. Tong, and M. Ng (2013). Block-diagonal discriminant analysis and its bias-corrected rules. Statistical applications in genetics and molecular biology 12, 347-359.
[31] Park, B.-J., S.-K. Oh, and W. Pedrycz (2013). The design of polynomial function-based neural network predictors for detection of software defects. Information Sciences 229, 40-57. · Zbl 1293.68236
[32] Qiao, X. and Y. Liu (2009). Adaptive weighted learning for unbalanced multicategory classification. Biometrics 65, 159-168. · Zbl 1159.62047
[33] Qiao, X., H. H. Zhang, Y. Liu, M. J. Todd, and J. S. Marron (2010). Weighted distance weighted discrimination and its asymptotic properties. Journal of the American Statistical Association 105, 401-414. · Zbl 1397.62227
[34] Qiao, X. and L. Zhang (2013). Distance-weighted support vector machine. arXiv preprint arXiv:1310.3003. · Zbl 1405.62083
[35] Qiao, X. and L. Zhang (2015). Flexible high-dimensional classification machines and their asymptotic properties. The Journal of Machine Learning Research 16, 1547-1572. · Zbl 1351.68229
[36] Ramaswamy, S., K. N. Ross, E. S. Lander, and T. R. Golub (2002). A molecular signature of metastasis in primary solid tumors. Nature genetics 33, 49.
[37] Ramey, J. (2016). Datamicroarray: collection of data sets for classification.
[38] Shao, J., Y. Wang, X. Deng, S. Wang, et al. (2011). Sparse linear discriminant analysis by thresholding for high dimensional data. Annals of statistics 39, 1241-1265. · Zbl 1215.62062
[39] Tian, E., F. Zhan, R. Walker, E. Rasmussen, Y. Ma, B. Barlogie, and J. D. Shaughnessy Jr (2003). The role of the Wnt-signaling antagonist DKK1 in the development of osteolytic lesions in multiple myeloma. New England Journal of Medicine 349, 2483-2494.
[40] Tibshirani, R., T. Hastie, B. Narasimhan, and G. Chu (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences 99, 6567-6572.
[41] Verbeke, W., K. Dejaeger, D. Martens, J. Hur, and B. Baesens (2012). New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research 218, 211-229.
[42] Witten, D. M. and R. Tibshirani (2011). Penalized classification using Fisher’s linear discriminant. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, 753-772. · Zbl 1228.62079
[43] Xie, J., M. Hao, W. Liu, and Y. Lin (2020). Fused variable screening for massive imbalanced data. Computational Statistics & Data Analysis 141, 94-108. · Zbl 1507.62190
[44] Zhu, M., W. Su, and H. A. Chipman (2006). Lago: A computationally efficient approach for statistical detection. Technometrics 48, 193-205.
[45] Zong, W., G.-B. Huang, and Y. Chen (2013). Weighted extreme learning machine for imbalance learning. Neurocomputing 101, 229-242.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.