Sparse probit linear mixed model. (English) Zbl 1460.62111

Summary: Linear mixed models (LMMs) are important tools in statistical genetics. When used for feature selection, they allow to find a sparse set of genetic traits that best predict a continuous phenotype of interest, while simultaneously correcting for various confounding factors such as age, ethnicity and population structure. Formulated as models for linear regression, LMMs have been restricted to continuous phenotypes. We introduce the sparse probit linear mixed model (Probit-LMM), where we generalize the LMM modeling paradigm to binary phenotypes. As a technical challenge, the model no longer possesses a closed-form likelihood function. In this paper, we present a scalable approximate inference algorithm that lets us fit the model to high-dimensional data sets. We show on three real-world examples from different domains that in the setup of binary labels, our algorithm leads to better prediction accuracies and also selects features which show less correlation with the confounding factors.


62J05 Linear regression; mixed models
62J12 Generalized linear models (logistic models)
62F15 Bayesian inference
62R07 Statistical aspects of big data and data science
62P10 Applications of statistics to biology and medical sciences; meta analysis


Eigenstrat; DREBIN; ccSVM
Full Text: DOI arXiv


[1] Arp, D., Spreitzenbarth, M., Hübner, M., Gascon, H., Rieck, K., & Siemens, C. (2014). DREBIN: Effective and explainable detection of android malware in your pocket. In Proceedings of NDSS. · Zbl 1069.62054
[2] Astle, W; Balding, DJ, Population structure and cryptic relatedness in genetic association studies, Statistical Science, 24, 451-471, (2009) · Zbl 1329.62419
[3] Atwell, S; Huang, YS; Vilhjálmsson, BJ; Willems, G; Horton, M; Li, Y; etal., Genome-wide association study of 107 phenotypes in arabidopsis thaliana inbred lines, Nature, 465, 627-631, (2010)
[4] Berry, MP; Graham, CM; McNab, FW; Xu, Z; Bloch, SA; Oni, T; etal., An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis, Nature, 466, 973-977, (2010)
[5] Bliss, CI, The method of probits, Science, 79, 38-39, (1934)
[6] Boufounos, P. T., & Baraniuk, R. G. (2008). 1-Bit compressive sensing. In IEEE 42nd Annual Conference on Information sciences and systems, 2008. CISS 2008 (pp. 16-21). · Zbl 1329.62429
[7] Boyd, S; Parikh, N; Chu, E; Peleato, B; Eckstein, J, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3, 1-122, (2011) · Zbl 1229.90122
[8] Breslow, NE; Clayton, DG, Approximate inference in generalized linear mixed models, Journal of the American Statistical Association, 88, 9-25, (1993) · Zbl 0775.62195
[9] Candès, EJ; Tao, T, Near optimal signal recovery from random projections: universal encoding strategies?, IEEE Transactions Information Theory, 52, 5406-5425, (2006) · Zbl 1309.94033
[10] Carbonetto, P; Stephens, M; etal., Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies, Bayesian Analysis, 7, 73-108, (2012) · Zbl 1330.62089
[11] Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society. Series B (Methodological), 20(2), 215-242. · Zbl 0088.35703
[12] Craddock, N; Hurles, ME; Cardin, N; etal., Genome-wide association study of CNVs in 16,000 cases of eight common diseases and 3,000 shared controls, Nature, 464, 713-720, (2010)
[13] Cunningham, J. P., Hennig, P., & Lacoste-Julien, S. (2011). Gaussian probabilities and expectation propagation. arXiv:1111.6832. · Zbl 0775.62195
[14] Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), 39(1), 1-38. · Zbl 0364.62022
[15] Donoho, DL, No article title, Compressed sensing. IEEE Transactions Information Theory, 52, 1289-1306, (2006) · Zbl 1288.94016
[16] Eckstein, J; Bertsekas, DP, On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators, Mathematical Programming, 55, 293-318, (1992) · Zbl 0765.90073
[17] Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). Regression. Berlin: Springer. · Zbl 1276.62046
[18] Fawcett, T, An introduction to ROC analysis, Pattern Recognition Letters, 27, 861-874, (2006) · Zbl 1106.74031
[19] Fisher, RA, The correlation between relatives on the supposition of Mendelian inheritance, Transactions of the Royal Society of Edinburgh, 52, 399-433, (1919)
[20] Fusi, N; Stegle, O; Lawrence, ND, Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical studies, PLoS Computational Biology, 8, e1002330, (2012)
[21] Henderson, CR, Estimation of genetic parameters, Annals of Mathematical Statistics, 6, 186-187, (1950)
[22] Hoffman, MD; Blei, DM; Wang, C; Paisley, J, Stochastic variational inference, The Journal of Machine Learning Research, 14, 1303-1347, (2013) · Zbl 1317.68163
[23] Imbens, G. W., & Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge: Cambridge University Press. · Zbl 1355.62002
[24] Klasen, JR; Barbez, E; Meier, L; Meinshausen, N; Bühlmann, P; Koornneef, M; etal., A multi-marker association method for genome-wide association studies without the need for population structure correction, Nature Communications, 7, 13299, (2016)
[25] Kraft, P; Zeggini, E; Ioannidis, JP, Replication in genome-wide association studies, Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 24, 561, (2009) · Zbl 1329.62429
[26] Li, L; Rakitsch, B; Borgwardt, KM, Ccsvm: correcting support vector machines for confounding factors in biological data classification, Bioinformatics, 27, 342-348, (2011)
[27] Lippert, C. (2013). Linear mixed models for genome-wide association studies. Ph.D. Thesis, Eberhard Karls Universität Tübingen.
[28] Lippert, C; Listgarten, J; Liu, Y; Kadie, C; Davidson, R; Heckerman, D, Fast linear mixed models for genome-wide association studies, Nature Methods, 8, 833-835, (2011)
[29] Manolio, TA; Collins, FS; Cox, NJ; Goldstein, DB; Hindorff, LA; Hunter, DJ; etal., Finding the missing heritability of complex diseases, Nature, 461, 747-753, (2009)
[30] Meinshausen, N; Bühlmann, P, Stability selection, Journal of the Royal Statistical Society, Series B, 72, 417-473, (2010)
[31] Minka, T. P. (2001). Expectation propagation for approximate bayesian inference, In Proceedings of the seventeenth conference on uncertainty in artificial intelligence (pp. 362-369). Morgan Kaufmann Publishers Inc.
[32] Mohamed, S., Heller, K., & Ghahramani, Z. (2011). Bayesian and L1 approaches for sparse unsupervised learning. arXiv:1106.1157.
[33] Morgan, S. L., & Winship, C. (2014). Counterfactuals and causal inference. Cambridge: Cambridge University Press.
[34] Neyman, J; Pearson, E, On the problem of the most efficient tests of statistical hypotheses, Philosophical Transactions of the Royal Society of London Series A, 231, 289-337, (1933) · Zbl 0006.26804
[35] NHGR Institute. (2009). Proceedings of the workshop on the dark matter of genomic associations with complex diseases: Explaining the unexplained heritability from genome-wide association studies. · Zbl 1317.68163
[36] Patterson, HD; Thompson, R, Recovery of inter-block information when block sizes are unequal, Biometrika, 58, 545-554, (1971) · Zbl 0228.62046
[37] Pearl, J; etal., Causal inference in statistics: an overview, Statistics Surveys, 3, 96-146, (2009) · Zbl 1300.62013
[38] Plan, Y., & Vershynin, R. (2012). One-bit compressed sensing by linear programming. arXiv:1109.4299. · Zbl 1335.94018
[39] Prékopa, A, On logarithmic concave measures and functions, Acta Scientiarum Mathematicarum, 34, 35-343, (1973) · Zbl 0264.90038
[40] Price, AL; Patterson, NJ; Plenge, RM; Weinblatt, ME; Shadick, NA; Reich, D, Principal components analysis corrects for stratification in genome-wide association studies, Nature Genetics, 38, 904-909, (2006)
[41] Price, AL; Zaitlen, NA; Reich, D; Patterson, N, New approaches to population stratification in genome-wide association studies, Nature Reviews Genetics, 11, 459-463, (2010)
[42] Ragab, A, On multivariate generalized logistic distribution, Microelectronics and Reliability, 31, 511-519, (1991)
[43] Rakitsch, B; Lippert, C; Stegle, O; Borgwardt, K, A lasso multi-marker mixed model for association mapping with population structure correction, Bioinformatics, 29, 206-214, (2013)
[44] Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Cambridge, MA, USA: MIT Press. · Zbl 1177.68165
[45] Seeger, MW; Nickisch, H, Large scale Bayesian inference and experimental design for sparse linear models, SIAM Journal on Imaging Sciences, 4, 166-199, (2011) · Zbl 1215.68232
[46] Song, M; Hao, W; Storey, JD, Testing for genetic associations in arbitrarily structured populations, Nature Genetics, 47, 550-554, (2015)
[47] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1), 267-288. · Zbl 0850.62538
[48] Vattikuti, S; Lee, JJ; Chang, CC; Hsu, SD; Chow, CC, Applying compressed sensing to genome-wide association studies, GigaScience, 3, 10, (2014)
[49] Vilhjálmsson, BJ; Nordborg, M, The nature of confounding in genome-wide association studies, Nature Reviews Genetics, 14, 1-2, (2013)
[50] Yu, J; Pressoir, G; Briggs, WH; Bi, IV; Yamasaki, M; Doebley, JF; etal., A unified mixed-model method for association mapping that accounts for multiple levels of relatedness, Nature Genetics, 38, 203-208, (2006)
[51] Zou, H; Hastie, T, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320, (2005) · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.