\(\ell_{1}\)-penalization for mixture regression models. (English) Zbl 1203.62128

Summary: We consider a finite mixture of regressions (FMR) model for high-dimensional inhomogeneous data where the number of covariates may be much larger than sample size. We propose an \(\ell _{1}\)-penalized maximum likelihood estimator in an appropriate parameterization. This kind of estimation belongs to a class of problems where optimization and theory for non-convex functions is needed. This distinguishes itself very clearly from high-dimensional estimation with convex loss- or objective functions as, for example, with the Lasso in linear or generalized linear models. Mixture models represent a prime and important example where non-convexity arises. For FMR models, we develop an efficient EM algorithm for numerical optimization with provable convergence properties. Our penalized estimator is numerically better posed (e.g., boundedness of the criterion function) than unpenalized maximum likelihood estimation, and it allows for effective statistical regularization including variable selection. We also present some asymptotic theory and oracle inequalities: due to non-convexity of the negative log-likelihood function, different mathematical arguments are needed than for problems with convex losses. Finally, we apply the new method to both simulated and real data.


62J12 Generalized linear models (logistic models)
62F12 Asymptotic properties of parametric estimators
62J07 Ridge regression; shrinkage estimators (Lasso)
90C90 Applications of mathematical programming
65C60 Computational problems in statistics (MSC2010)


flexmix; glmnet
Full Text: DOI arXiv


[1] Bertsekas D (1995) Nonlinear programming. Athena Scientific, Belmont · Zbl 0935.90037
[2] Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732 · Zbl 1173.62022
[3] Bunea F, Tsybakov A, Wegkamp M (2007) Sparsity oracle inequalities for the Lasso. Electron J Stat 1:169–194 · Zbl 1146.62028
[4] Cai T, Wang L, Xu G (2009a) Stable recovery of sparse signals and an oracle inequality. Tech rep, Department of Statistics, University of Pennsylvania · Zbl 1366.94085
[5] Cai T, Xu G, Zhang J (2009b) On recovery of sparse signals via 1 minimization. IEEE Trans Inf Theory 55:3388–3397 · Zbl 1367.94081
[6] Candès E, Plan Y (2009) Near-ideal model selection by 1 minimization. Ann Stat 37:2145–2177 · Zbl 1173.62053
[7] Candès E, Tao T (2005) Decoding by linear programming. IEEE Trans Inf Theory 51:4203–4215 · Zbl 1264.94121
[8] Candès E, Tao T (2007) The Dantzig selector: statistical estimation when p is much larger than n (with discussion). Ann Stat 35:2313–2404 · Zbl 1139.62019
[9] Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38 · Zbl 0364.62022
[10] Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96:1348–1360 · Zbl 1073.62547
[11] Friedman J, Hastie T, Hoefling H, Tibshirani R (2007) Pathwise coordinate optimization. Ann Appl Stat 1:302–332 · Zbl 1378.90064
[12] Friedman J, Hastie T, Tibshirani R (2008) Regularized paths for generalized linear models via coordinate descent. Tech rep, Department of Statistics, Stanford University
[13] Fu WJ (1998) Penalized regression: the Bridge versus the Lasso. J Comput Graph Stat 7:397–416
[14] Greenshtein E, Ritov Y (2004) Persistence in high-dimensional predictor selection and the virtue of over-parametrization. Bernoulli 10:971–988 · Zbl 1055.62078
[15] Grün B, Leisch F (2007) Fitting finite mixtures of generalized linear regressions in R. Comput Stat Data Anal 51:5247–5252. doi: 10.1016/j.csda.2006.08.014 · Zbl 1445.62192
[16] Grün B, Leisch F (2008) FlexMix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28:1–35. http://www.jstatsoft.org/v28/i04/
[17] Huang J, Ma S, Zhang CH (2008) Adaptive Lasso for sparse high-dimensional regression models. Stat Sin 18:1603–1618 · Zbl 1255.62198
[18] Khalili A, Chen J (2007) Variable selection in finite mixture of regression models. J Am Stat Assoc 102:1025–1038 · Zbl 1469.62306
[19] Koltchinskii V (2009) The Dantzig selector and sparsity oracle inequalities. Bernoulli 15:799–828 · Zbl 1452.62486
[20] Lehmann E (1983) Theory of point estimation. Wadsworth and Brooks/Cole, Pacific Grove · Zbl 0522.62020
[21] Leisch F (2004) FlexMix: a general framework for finite mixture models and latent class regression in R. J Stat Softw 11:1–18. http://www.jstatsoft.org/v11/i08/
[22] McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York · Zbl 0963.62061
[23] Meier L, van de Geer S, Bühlmann P (2008) The group Lasso for logistic regression. J R Stat Soc, Ser B 70:53–71 · Zbl 1400.62276
[24] Meinshausen N, Bühlmann P (2006) High dimensional graphs and variable selection with the Lasso. Ann Stat 34:1436–1462 · Zbl 1113.62082
[25] Meinshausen N, Yu B (2009) Lasso-type recovery of sparse representations for high-dimensional data. Ann Stat 37:246–270 · Zbl 1155.62050
[26] Pan W, Shen X (2007) Penalized model-based clustering with application to variable selection. J Mach Learn Res 8:1145–1164 · Zbl 1222.68279
[27] Park T, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686 · Zbl 1330.62292
[28] Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc, Ser B 58:267–288 · Zbl 0850.62538
[29] Tseng P (2001) Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 109:475–494 · Zbl 1006.65062
[30] Tseng P, Yun S (2008) A coordinate gradient descent method for nonsmooth separable minimization. Math Program, Ser B 117:387–423 · Zbl 1166.90016
[31] Tsybakov A (2004) Optimal aggregation of classifiers in statistical learning. Ann Stat 32:135–166 · Zbl 1105.62353
[32] van de Geer S (2000) Empirical processes in M-estimation. University Press, Cambridge · Zbl 0953.62049
[33] van de Geer S (2008) High-dimensional generalized linear models and the Lasso. Ann Stat 36:614–645 · Zbl 1138.62323
[34] van de Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392 · Zbl 1327.62425
[35] van de Geer S, Zhou S, Bühlmann P (2010) Prediction and variable selection with the Adaptive Lasso. Arxiv preprint arXiv:1001.5176 [mathST]
[36] van der Vaart A (2007) Asymptotic statistics. University Press, Cambridge · Zbl 0910.62001
[37] van der Vaart A, Wellner J (1996) Weak convergence and empirical processes. Springer, Berlin · Zbl 0862.60002
[38] Wainwright M (2009) Sharp thresholds for high-dimensional and noisy sparsity recovery using 1-constrained quadratic programming (Lasso). IEEE Trans Inf Theory 55:2183–2202 · Zbl 1367.62220
[39] Wu C (1983) On the convergence properties of the EM algorithm. Ann Stat 11:95–103 · Zbl 0517.62035
[40] Zhang T (2009) Some sharp performance bounds for least squares regression with L1 regularization. Ann Stat 37:2109 –2144 · Zbl 1173.62029
[41] Zhang CH (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38:894–942 · Zbl 1183.62120
[42] Zhang CH, Huang J (2008) The sparsity and bias of the Lasso selection in high-dimensional linear regression. Ann Stat 36:1567–1594 · Zbl 1142.62044
[43] Zhao P, Yu B (2006) On model selection consistency of Lasso. J Mach Learn Res 7:2541–2563 · Zbl 1222.62008
[44] Zou H (2006) The adaptive Lasso and its oracle properties. J Am Stat Assoc 101:1418–1429 · Zbl 1171.62326
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.