Local uncertainty sampling for large-scale multiclass logistic regression. (English) Zbl 1452.62163

For analyzing huge data sets using multiclass logistic regression when computational facilities are not available, one of the often used methods is to subsample a data set which can be accommodated within the available computer resources. There are two types of imbalances in classes, namely marginal imbalance (MI) when some classes are rarer than others and conditional imbalance (CI) when the class labels are easy to predict for most of the observations. For MI binary classification, case control (CC) subsampling is used with an equal number of samples from each class uniformly.
In this paper, the authors review one of the earlier subsampling schemes for a binary logistic regression termed as local case control (LCC) sampling. This scheme is shown to fare better than the uniform random sampling with respect to asymptotic variance criterion of the estimates obtained.
Next, they propose general subsampling schemes for large scale multiclass logistic regression problems. The method consists of selecting data points with labels that are conditionally uncertain given their local observations based on the predicted probability distribution and then fitting a multiclass logistic model for estimating the model parameter.
Simulation and real world data sets, namely MNIST and Web-spam data are considered and it is confirmed that the LUS method fares better than uniform sampling, CC sampling and LCC sampling under various settings. If the full sample size \((n)\) based mle has asymptotic variance \(v\), then the LUS has asymptotic variance less than \(e v\) \((e>1)\), now based on a sample size of \(n/ e\).


62D05 Sampling theory, sample surveys
62F10 Point estimation
62J12 Generalized linear models (logistic models)
Full Text: DOI arXiv Euclid


[1] Abe, N., Zadrozny, B. and Langford, J. (2004). An iterative method for multi-class cost-sensitive learning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 3-11.
[2] Anderson, J. A. (1972). Separate sample logistic discrimination. Biometrika 59 19-35. · Zbl 0231.62080
[3] Atkeson, C. G., Moore, A. W. and Schaal, S. (1997). Locally weighted learning for control. In Lazy Learning 75-113. Springer, Berlin.
[4] Breslow, N. (1982). Design and analysis of case-control studies. Annu. Rev. Public Health 3 29-54.
[5] Chawla, N. V., Japkowicz, N. and Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. ACM SIGKDD Explor. Newsl. 6 1-6.
[6] Cortes, C., Mansour, Y. and Mohri, M. (2010). Learning bounds for importance weighting. In Advances in Neural Information Processing Systems 442-450.
[7] Cortes, C., Mohri, M., Riley, M. and Rostamizadeh, A. (2008). Sample selection bias correction theory. In Algorithmic Learning Theory. Lecture Notes in Computer Science 5254 38-53. Springer, Berlin. · Zbl 1156.68524
[8] Dhillon, P., Lu, Y., Foster, D. P. and Ungar, L. (2013). New subsampling algorithms for fast least squares regression. In Advances in Neural Information Processing Systems 360-368.
[9] Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Ann. Statist. 42 1693-1724. · Zbl 1305.62096
[10] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Ann. Statist. 28 337-407. · Zbl 1106.62323
[11] Han, L., Tan, K. M., Yang, T. and Zhang, T. (2020). Supplement to “Local uncertainty sampling for large-scale multiclass logistic regression.” https://doi.org/10.1214/19-AOS1867SUPP.
[12] He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21 1263-1284.
[13] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663-685. · Zbl 0047.38301
[14] Kim, H.-C., Pang, S., Je, H.-M., Kim, D. and Bang, S. Y. (2002). Pattern classification using support vector machine ensemble. In Proceedings of the International Conference on Pattern Recognition 2 160-163.
[15] King, G. and Zeng, L. (2001). Logistic regression in rare events data. Polit. Anal. 9 137-163.
[16] LeCun, Y., Bottou, L., Bengio, Y. and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE 86 2278-2324.
[17] Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies. J. Natl. Cancer Inst. 22 719-748.
[18] Mineiro, P. and Karampatziakis, N. (2013). Loss-proportional subsampling for subsequent ERM. Preprint. Available at arXiv:1306.1840.
[19] Scott, A. and Wild, C. (2002). On the robustness of weighted methods for fitting models to case-control data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 207-219. · Zbl 1059.62010
[20] Scott, A. J. and Wild, C. J. (1986). Fitting logistic models under case-control or choice based sampling. J. Roy. Statist. Soc. Ser. B 48 170-182. · Zbl 0608.62084
[21] Scott, A. J. and Wild, C. J. (1991). Fitting logistic regression models in stratified case-control studies. Biometrics 47 497-510. · Zbl 0736.62093
[22] Tan, A. C., Gilbert, D. and Deville, Y. (2003). Multi-class protein fold classification using a new ensemble machine learning approach. Genome Inform. 14 206-217.
[23] Webb, S., Caverlee, J. and Pu, C. (2006). Introducing the Webb Spam Corpus: Using email spam to identify Web spam automatically. In Proceedings of the Third Conference on Email and Anti-Spam.
[24] Widodo, A. and Yang, B.-S. (2007). Support vector machine in machine condition monitoring and fault diagnosis. Mech. Syst. Signal Process. 21 2560-2574.
[25] Xie, Y. and Manski, C. F. (1989). The logit model and response-based samples. Sociol. Methods Res. 17 283-302.
[26] Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the International Conference on Machine Learning 114.
[27] Zhang, T.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.