Sparse logistic principal components analysis for binary data. (English) Zbl 1202.62084

Summary: We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from a penalized Bernoulli likelihood. A majorization-minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.


62H25 Factor analysis and principal components; correspondence analysis
65C60 Computational problems in statistics (MSC2010)
90C90 Applications of mathematical programming
Full Text: DOI arXiv


[1] Böhning, D. (1999). The lower bound method in probit regression. Comput. Statist. Data Anal. 30 13-17. · Zbl 1042.62577 · doi:10.1016/S0167-9473(98)00094-2
[2] Brookes, A. J. (1999). Review: The essence of SNPs. Gene 234 177-186.
[3] Collins, M., Dasgupta, S. and Schapire, R. E. (2002). A generalization of principal component analysis to the exponential family. In Advanced in Neural Information Processing System (T. G. Dietterich, S. Becker and Z. Ghahramani, eds.) 14 617-642. MIT Press, Cambridge, MA.
[4] de Leeuw, J. (2006). Principal component analysis of binary data by iterated singular value decomposition. Comput. Statist. Data Anal. 50 21-39. · Zbl 1429.62218
[5] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39 1-38. · Zbl 0364.62022
[6] Ewens, W. J. and Spielman, R. S. (1995). The transmission/disequilibrium test: History, subdivision, and admixture. The American Journal of Human Genetics 57 455-464.
[7] Golub, G. and van Loan, C. F. (1996). Matrix Computations , 3rd ed. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 0865.65009
[8] Hao, K., Li, C., Rosenow, C. and Wong, W. H. (2004). Detect and adjust for population stratification in population-based association study using genomic control markers: An application of Affymetrix Genechip \textregistered Human Mapping 10K array. European Journal of Human Genetics 12 1001-1006.
[9] Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 24 417-441.
[10] Hunter, D. R. and Lange, K. (2004). A tutorial on MM algorithms. Amer. Statist. 58 30-37. · doi:10.1198/0003130042836
[11] Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617-1642. · Zbl 1078.62028 · doi:10.1214/009053605000000200
[12] Jaakkola, T. S. and Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statist. Comput. 10 25-37.
[13] Jolliffe, I. T. (2002). Principal Component Analysis , 2nd ed. Springer, New York. · Zbl 1011.62064
[14] Jolliffe, I. T., Trendafilov, M. and Uddine, M. (2003). A modified principal component technique based on the LASSO. J. Comput. Graph. Statist. 12 531-547. · doi:10.1198/1061860032148
[15] Kwok, P. Y., Deng, Q., Zakeri, H., Taylor, S. L. and Nickerson, D. A. (1996). Increasing the information content of STS-based genome maps: Identifying polymorphisms in mapped STSs. Genomics 31 123-126.
[16] Lange, K., Hunter, D. R. and Yang, I. (2000). Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Statist. 9 1-20.
[17] Lee, S., Huang, J. Z. and Hu, J. (2010). The MM algorithm for sparse logistic PCA using the tight bound: A supplementary note to “Sparse logistic principal components analysis for binary data.” .
[18] Liang, Y. and Kelemen, A. (2008). Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Stat. Surv. 2 43-60. · Zbl 1196.62144 · doi:10.1214/07-SS026
[19] Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Pholosophical Magazine and Journal of Science, Sixth Series 2 559-572.
[20] Risch, N., Burchard, E., Ziv, E. and Tang, H. (2002). Categorization of humans in biomedical research: Genes, race and disease. Genome Biology 3 comment 2007.1-2007.12.
[21] Schein, A. I., Saul, L. K. and Ungar, L. H. (2003). A generalized linear model for principal component analysis of binary data. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics (C. M. Bishop and B. J. Frey, eds.) 14-21. Key West, FL.
[22] Serre, D., Montpetit, A., Paré, G., Engert, J. G., Yusuf, S., Keavney, B., Hudson, K. J. and Anand, S. (2008). Correction of population stratification in large multi-ethnic association studies. PLoS ONE 2 e1382.
[23] Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99 1015-1034. · Zbl 1141.62049 · doi:10.1016/j.jmva.2007.06.007
[24] The International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437 1299-1320.
[25] Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[26] Zou, H., Hastie, T. J. and Tibshirani, R. J. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265-286. · doi:10.1198/106186006X113430
[27] Zou, H., Hastie, T. J. and Tibshirani, R. J. (2007). On the “Degrees of Freedom” of the LASSO. Ann. Statist. 35 2173-2192. · Zbl 1126.62061 · doi:10.1214/009053607000000127
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.