Using synthetic data and dimensionality reduction in high-dimensional classification via logistic regression. (English) Zbl 1449.62144

Summary: Traditional logistic regression is plugged with degenerates and violent behavior in high-dimensional classification, because of the problem of non-invertible matrices in estimating model parameters. In this paper, to overcome the high-dimensionality of data, we introduce two new algorithms. First, we improve the efficiency of finite population Bayesian bootstrapping logistic regression classifier by using the rule of majority vote. Second, using simple random sampling without replacement to select a smaller number of covariates rather than the sample size and applying traditional logistic regression, we introduce the other new algorithm for high-dimensional binary classification. We compare the proposed algorithms with the regularized logistic regression models and two other classification algorithms, i.e., naive Bayes and \(K\)-nearest neighbors using both simulated and real data.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62J12 Generalized linear models (logistic models)
Full Text: Link


[1] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine,Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc Nat Acad Sci USA,96(12) (1999), 6745-6750.
[2] R. Bellman,Dynamic programming, Princeton University Press, 1957. · Zbl 0077.13605
[3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,Smote: Synthetic minority over-sampling technique, Journal of Artificial Intelligence Research,16(2002), 321-357. · Zbl 0994.68128
[4] A. Christobel and Y. Sivaprakasam,An empirical comparison of data mining classification methods, International Journal of Computer Information Systems,3(2) (2011), 24-28.
[5] R. D. Cook,Graphics for regression with a binary response, Journal of the American Statistical Association,91(435) (1996), 983-992. · Zbl 0882.62060
[6] R. D. Cook and H. Lee,Dimension reduction in binary response regression, Journal of the American Statistical Association,94(448) (1999), 1187-1200. · Zbl 1072.62619
[7] S. A. Czepiel,Maximum likelihood estimation of logistic regression models: theory and implementation, Available at czep.net/stat/mlelr.pdf (2002).
[8] S. Dudoit, J. Fridlyand, and T. P. Speed,Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97(457) (2002), 77-87. · Zbl 1073.62576
[9] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani,Pathwise coordinate optimization, Annals of Applied Statistics,1(2) (2007), 302-332. · Zbl 1378.90064
[10] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani,Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software,33(1) (2010), 1-22.
[11] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomfield, and E. Lander,Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science,286(5439) (1999), 531-537.
[12] D. Guan, W. Yuan, Y. K. Lee, K. Najeebullah, and M. K. Rasel,A review of ensemble learning based feature selection, IETE Technical Review,31(3) (2014), 190-198.
[13] T. K. Ho,The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence,20(8) (1998), 832-844.
[14] A. E. Hoerl and R. W. Kennard,Ridge regression: Biased estimation for nonorthogonal problems,Technometrics,12(1) (1970), 55-67. · Zbl 0202.17205
[15] D. W. Hosmer and S. Lemeshow,Applied logistic regression,Wiley, New York, 2013. · Zbl 1276.62050
[16] K. Lee, H. Ahn, H. Moon, R. L. Kodell, and J. J. Chen,Multinomial logistic regression ensembles, Journal of Biopharmaceutical Statistics,23(3) (2013), 681-694.
[17] K. C. Li,Sliced inverse regression for dimension reduction, Journal of the American Statistical Association,86(414) (1991), 316-327. · Zbl 0742.62044
[18] Y. Liang, C. Liu, X. Z. Luan, K. S. Leung, T. M. Chan, Z. B. Xu, and H. Zhang,Sparse logistic regression with aL1/2penalty for gene selection in cancer classification, BMC Bioinformatics, 14(1) (2013), 1-12.
[19] N. Lim, H. Ahn, H. Moon, and J. J. Chen,Classification of high-dimensional data with ensemble of logistic regression models, Journal of Biopharmaceutical Statistics,20(1) (2010), 160-171.
[20] A. Y. Lo,A Bayesian bootstrap for finite population, Annals of Statistics,16(1988), 1684-1695. · Zbl 0691.62005
[21] G. Meeden, L. Radu, and J. G. Charles,Polyapost: Simulating from the Polya Posterior, R package version 1.5. https://CRAN.R-project.org/package=polyapost (2017).
[22] D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch,e1071: Misc functions of the department of statistics, Probability Theory Group (Formerly: E1071), TU Wien. R package version 1.6-8. https://CRAN.R-project.org/package=e1071 (2017).
[23] W. Sanford,Dimension reduction regression inR, Journal of Statistical Software,7(2002), 1-22.
[24] R. Tibshirani,Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological),58(1) (1996), 267-288. · Zbl 0850.62538
[25] S. Wang, X. Chen, J. Z. Huang, and S. Feng,Scalable subspace logistic regression models for high-dimensional data, APWeb 2012, LNCS 7235 (2012), 685-694.
[26] X. Zhang, Y. Fu, A. Zang, L. Sigal, and G. Agam,Learning classifiers from synthetic data using a multichannel autoencoder, arXiv:1503.03163 (2015).
[27] H. Zou and T. Hastie,Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology)67(2) (2005), 301-320. · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.