×

Nearly unbiased variable selection under minimax concave penalty. (English) Zbl 1183.62120

Summary: We propose MC\(+\), a fast, continuous, nearly unbiased and accurate method of penalized variable selection in high-dimensional linear regression. The LASSO is fast and continuous, but biased. The bias of the LASSO may prevent consistent variable selection. Subset selection is unbiased but computationally costly. The MC\(+\) has two elements: a minimax concave penalty (MCP) and a penalized linear unbiased selection (PLUS) algorithm. The MCP provides the convexity of the penalized loss in sparse regions to the greatest extent given certain thresholds for variable selection and unbiasedness. The PLUS computes multiple exact local minimizers of a possibly nonconvex penalized loss function in a certain main branch of the graph of critical points of the penalized loss. Its output is a continuous piecewise linear path encompassing from the origin for infinite penalty to a least squares solution for zero penalty.
We prove that at a universal penalty level the MC\(+\) has high probability of matching the signs of the unknowns, and thus correct selection, without assuming the strong irrepresentable condition required by the LASSO. This selection consistency applies to the case of \(p\gg n\), and is proved to hold for exactly the MC\(+\) solution among possibly many local minimizers. We prove that the MC\(+\) attains certain minimax convergence rates in probability for the estimation of regression coefficients in \(\ell _r\) balls. We use the SURE method to derive degrees of freedom and C\(_p\)-type risk estimates for general penalized LSE, including the LASSO and MC\(+\) estimators, and prove their unbiasedness. Based on the estimated degrees of freedom, we propose an estimator of the noise level for proper choice of the penalty level. For full rank designs and general sub-quadratic penalties, we provide necessary and sufficient conditions for the continuity of the penalized LSE. Simulation results overwhelmingly support our claim of superior variable selection properties and demonstrate the computational efficiency of the proposed method.

MSC:

62J05 Linear regression; mixed models
62H12 Estimation in multivariate analysis
65C60 Computational problems in statistics (MSC2010)
62J07 Ridge regression; shrinkage estimators (Lasso)
62H25 Factor analysis and principal components; correspondence analysis

Software:

Bolasso; S+WAVELETS

References:

[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd International Symposium on Information Theory (V. Petrov and F. Csáki, eds.) 267-281. Akadmiai Kiadó, Budapest. · Zbl 0283.62006
[2] Antoniadis, A. and Fan, J. (2001). Regularized wavelet approximations (with discussion). J. Amer. Statist. Assoc. 96 939-967. JSTOR: · Zbl 1072.62561 · doi:10.1198/016214501753208942
[3] Bach, F. R. (2008). Bolasso: Model consistent Lasso estimation through the bootstrap. In Proceedings of the 25th Annual International Conference on Machine Learning (ICML 2008, Helsinki, Finland) (A. McCallum and S. Roweis, eds.) 33-40.
[4] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the lasso. Electron. J. Stat. 1 169-194 (electronic). · Zbl 1146.62028 · doi:10.1214/07-EJS008
[5] Candés, E. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203-4215. · Zbl 1264.94121 · doi:10.1109/TIT.2005.858979
[6] Candés, E. and Tao, T. (2007). The Dantzig selector: Statistical estimation when p is much larger than n (with discussion). Ann. Statist. 35 2313-2404. · Zbl 1139.62019 · doi:10.1214/009053606000001523
[7] Chen, S. and Donoho, D.L. (1994). On basis pursuit. Technical report, Dept. Statistics, Stanford Univ.
[8] Davidson, K. and Szarek, S. (2001). Local operator theory, random matrices and Banach spaces. In Handbook on the Geometry of Banach Spaces (W. B. Johnson and J. Lindenstrauss, eds.) I 317-366. North-Holland, Amsterdam. · Zbl 1067.46008 · doi:10.1016/S1874-5849(01)80010-3
[9] Donoho, D. L. and Johnstone, I. (1994a). Minimax risk over \ell p -balls for \ell q -error. Probab. Theory Related Fields 99 277-303. · Zbl 0802.62006 · doi:10.1007/BF01199026
[10] Donoho, D. L. and Johnstone, I. M. (1994b). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425-455. JSTOR: · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425
[11] Donoho, D. L., Johnstone, I. M., Hoch, J. C. and Stern, A. S. (1992). Maximum entropy and the nearly black object (with discussion). J. Roy. Statist. Soc. Ser. B 54 41-81. JSTOR: · Zbl 0788.62103
[12] Efron, B. (1986). How biased is the apparent error of a prediction rule? J. Amer. Statist. Assoc. 81 461-470. · Zbl 0621.62073 · doi:10.2307/2289236
[13] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407-499. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[14] Efron, B., Hastie, T. and Tibshirani, R. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2358-2364. · doi:10.1214/009053607000000433
[15] Fan, J. (1997). Comments on “Wavelets in statistics: A review” by A. Antoniadis. J. Italian Statist. Assoc. 6 131-138.
[16] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. JSTOR: · Zbl 1073.62547 · doi:10.1198/016214501753382273
[17] Fan, J. and Lv, Jinchi. (2008). Sure independence screening for ultrahigh-dimensional feature space. J. Roy. Statist. Soc. Ser. B 70 849-911.
[18] Fan, J. and Peng, H. (2004). On nonconcave penalized likelihood with diverging number of parameters. Ann. Statist. 32 928-961. · Zbl 1092.62031 · doi:10.1214/009053604000000256
[19] Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947-1975. · Zbl 0829.62066 · doi:10.1214/aos/1176325766
[20] Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference 148-156. Morgan Kauffmann, San Francisco.
[21] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337-307. · Zbl 1106.62323 · doi:10.1214/aos/1016218223
[22] Gao, H.-Y. and Bruce, A. G. (1997). Waveshrink with firm shrinkage. Statist. Sinica 7 855-874. · Zbl 1067.62529
[23] Genkin, A. Lewis, D. D. and Madigan, D. (2004). Large-scale Bayesian logistic regression for text categorization. Technical report, DIMACS, Rutgers Univ.
[24] Greenshtein E. and Ritov Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971-988. · Zbl 1055.62078 · doi:10.3150/bj/1106314846
[25] Huang, J., Ma, S. and Zhang, C.-H. (2008). Adaptive Lasso for sparse high-dimensional regression models. Statist. Sinica 18 1603-1618. · Zbl 1255.62198
[26] Hunter, D. R. and Li, R. (2005). Variable selection using MM algorithms. Ann. Statist. 33 1617-1642. · Zbl 1078.62028 · doi:10.1214/009053605000000200
[27] Mallows, C. L. (1973). Some comments on Cp. Technometrics 12 661-675. · Zbl 0269.62061 · doi:10.2307/1267380
[28] Meinshausen, N. (2007). Relaxed Lasso. Comput. Statist. Data Anal. 52 374-393. · Zbl 1452.62522
[29] Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436-1462. · Zbl 1113.62082 · doi:10.1214/009053606000000281
[30] Meinshausen, N., Rocha, G. and Yu, B. (2007). Discussion: The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Statist. 35 2373-2384. · doi:10.1214/009053607000000460
[31] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37 2246-2270. · Zbl 1155.62050 · doi:10.1214/07-AOS582
[32] Meyer, M. and Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Ann. Statist. 28 1083-1104. · Zbl 1105.62340 · doi:10.1214/aos/1015956708
[33] Osborne, M., Presnell, B. and Turlach, B. (2000a). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389-404. · Zbl 0962.65036 · doi:10.1093/imanum/20.3.389
[34] Osborne, M., Presnell, B. and Turlach, B. (2000b). On the lasso and its dual. J. Comput. Graph. Statist. 9 319-337. JSTOR: · doi:10.2307/1390657
[35] Park, M. Y. and Hastie, T. (2007). An L1 regularization-path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 659-677. · doi:10.1111/j.1467-9868.2007.00607.x
[36] Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Ann. Statist. 35 1012-1030. · Zbl 1194.62094 · doi:10.1214/009053606000001370
[37] Schapire, R. E. (1990). The strength of weak learnability. Machine Learning 5 197-227. · Zbl 1142.62372
[38] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[39] Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135-1151. · Zbl 0476.62035 · doi:10.1214/aos/1176345632
[40] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. JSTOR: · Zbl 0850.62538
[41] Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030-1051. · Zbl 1288.94025 · doi:10.1109/TIT.2005.864420
[42] Van de Geer, S. (2008). High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 614-645. · Zbl 1138.62323 · doi:10.1214/009053607000000929
[43] Wainwright, M. (2006). Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical Report 708, Dept. Statistics, Univ. California, Berkeley.
[44] Ye, F. and Zhang, C.-H. (2009). Rate Minimaxity of the Lasso and Dantzig Estimators. Technical Report No. 2009-001. Dept. Statistics, Rutgers Univ.
[45] Yuan, M. and Lin, Y. (2007). On the nonnegative garrote estimator. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 143-161. · Zbl 1120.62052 · doi:10.1111/j.1467-9868.2007.00581.x
[46] Zhang, C.-H. (2007a). Continuous generalized gradient descent. J. Comput. Graph. Statist. 16 761-781. · doi:10.1198/106186007X238846
[47] Zhang, C.-H. (2007b). Penalized linear unbiased selection. Technical Report 2007-003. Dept. Statistics, Rutgers Univ.
[48] Zhang, C.-H. (2007c). Information-theoretic optimality of variable selection with concave penalty. Technical Report No. 2007-008. Dept. Statistics, Rutgers Univ.
[49] Zhang, C.-H. (2008). Discussion: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist. 36 1553-1560. · Zbl 1282.62110 · doi:10.1214/07-AOS0316C
[50] Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the LASSO selection in high-dimensional regression. Ann. Statist. 36 1567-1594. · Zbl 1142.62044 · doi:10.1214/07-AOS520
[51] Zhao, P. and Yu, B. (2006). On model selection consistency of LASSO. J. Mach. Learn. Res. 7 2541-2567. · Zbl 1222.62008
[52] Zhao, P. and Yu, B. (2007). Stagewise Lasso. J. Mach. Learn. Res. 8 2701-2726. · Zbl 1222.68345
[53] Zou, H. (2006). The adaptive Lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[54] Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Statist. 36 1509-1533. · Zbl 1142.62027 · doi:10.1214/009053607000000802
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.