×

The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). (With discussions and rejoinder). (English) Zbl 1139.62019

Summary: In many important statistical applications, the number of variables or parameters \(p\) is much larger than the number of observations \(n\). Suppose then that we have observations \(y=X\beta+z\), where \(\beta\in\mathbb R^p\) is a parameter vector of interest, \(X\) is a data matrix with possibly far fewer rows than columns, \(n\ll p\), and the \(z_i\)’s are i.i.d. \(N(0,\sigma^2)\). Is it possible to estimate \(\beta\) reliably based on the noisy data \(y\)? To estimate \(\beta\), we introduce a new estimator – we call it the Dantzig selector – which is a solution to the \(\ell_1\)-regularization problem
\[ \min_{\widetilde{\beta}\in\mathbb R^p} \|\widetilde{\beta}\|_{\ell_1} \quad\text{subject to}\quad \|X^*r\|_{\ell_\infty}\leq (1+t^{-1}) \sqrt{2\log p}\cdot\sigma, \]
where \(r\) is the residual vector \(y-X\widetilde{\beta}\) and \(t\) is a positive scalar. We show that if \(X\) obeys a uniform uncertainty principle (with unit-normed columns) and if the true parameter vector \(\beta\) is sufficiently sparse (which here roughly guarantees that the model is identifiable), then with very large probability,
\[ \|\widehat{\beta}-\beta\|_{\ell_2}^2\leq C^2\cdot 2\log p\cdot \Bigl(\sigma^2+ \sum_i \min(\beta_i^2,\sigma^2)\Bigr). \]
Our results are nonasymptotic and we give values for the constant \(C\). Even though \(n\) may be much smaller than \(p\), our estimator achieves a loss within a logarithmic factor of the ideal mean squared error one would achieve with an oracle which would supply perfect information about which coordinates are nonzero, and which were above the noise level.
In multivariate regression and from a model selection viewpoint, our result says that it is possible nearly to select the best subset of variables by solving a very simple convex program, which, in fact, can easily be recast as a convenient linear program.

MSC:

62G08 Nonparametric regression and quantile regression
62G05 Nonparametric estimation
94A08 Image processing (compression, reconstruction, etc.) in information and communication theory
94A12 Signal theory (characterization, reconstruction, filtering, etc.)
90C05 Linear programming
62C05 General considerations in statistical decision theory

Software:

lars; PDCO

References:

[1] Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Automatic Control 19 716-723. · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705
[2] Antoniadis, A. and Fan, J. (2001). Regularization of wavelet approximations (with discussion). J. Amer. Statist. Assoc. 96 939-967. · Zbl 1072.62561 · doi:10.1198/016214501753208942
[3] Baraud, Y. (2000). Model selection for regression on a fixed design. Probab. Theory Related Fields 117 467-493. · Zbl 0997.62027 · doi:10.1007/s004400000058
[4] Barron, A. R., Birgé, L. and Massart, P. (1999). Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 301-413. · Zbl 0946.62036 · doi:10.1007/s004400050210
[5] Barron, A. R. and Cover, T. M. (1991). Minimum complexity density estimation. IEEE Trans. Inform. Theory 37 1034-1054. · Zbl 0743.62003 · doi:10.1109/18.86996
[6] Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation. In Festschrift for Lucien Le Cam (D. Pollard, E. Torgersen and G. L. Yang, eds.) 55-87. Springer, New York. · Zbl 0920.62042
[7] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3 203-268. · Zbl 1037.62001 · doi:10.1007/s100970100031
[8] Boyd, S. and Vandenberghe L. (2004). Convex Optimization . Cambridge Univ. Press. · Zbl 1058.90049
[9] Candès, E. J. and Romberg, J. (2005). Practical signal recovery from random projections. In Computational Imaging III : Proc. SPIE International Symposium on Electronic Imaging 1 76-86. San Jose, CA.
[10] Candès, E. J., Romberg, J. and Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 1207-1223. · Zbl 1098.94009 · doi:10.1002/cpa.20124
[11] Candès, E. J., Romberg, J. and Tao, T. (2006). Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theory 52 489-509. · Zbl 1231.94017 · doi:10.1109/TIT.2005.862083
[12] Candès, E. J., Rudelson, M., Vershynin, R. and Tao, T. (2005). Error correction via linear programming. In Proc. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS) 295-308. IEEE, Los Alamitos, CA.
[13] Candès, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203-4215. · Zbl 1264.94121 · doi:10.1109/TIT.2005.858979
[14] Candès, E. J. and Tao, T. (2006). Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inform. Theory 52 5406-5425. · Zbl 1309.94033 · doi:10.1109/TIT.2006.885507
[15] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33-61. · Zbl 0919.94002 · doi:10.1137/S1064827596304010
[16] Daniel, B. L., Yen, Y. F., Glover, G. H. et al. (1998). Breast disease: Dynamic spiral MR imaging. Radiology 209 499-509.
[17] Daubechies, I. (2005). Personal communication.
[18] Donoho, D. L. (2006). For most large underdetermined systems of linear equations the minimal \(\ell_1\)-norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 797-829. · Zbl 1113.15004 · doi:10.1002/cpa.20132
[19] Donoho, D. L. (2006). Compressed sensing. IEEE Trans. Inform. Theory 52 1289-1306. · Zbl 1288.94016 · doi:10.1109/TIT.2006.871582
[20] Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845-2862. · Zbl 1019.94503 · doi:10.1109/18.959265
[21] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425-455. · Zbl 0815.62019 · doi:10.1093/biomet/81.3.425
[22] Donoho, D. L. and Johnstone, I. M. (1994). Ideal denoising in an orthonormal basis chosen from a library of bases. C. R. Acad. Sci. Paris Sér. I Math. 319 1317-1322. · Zbl 0819.94007
[23] Donoho, D. L. and Johnstone, I. M. (1995). Empirical atomic decomposition. Unpublished manuscript.
[24] Elad, M. and Bruckstein, A. M. (2002). A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. Inform. Theory 48 2558-2567. · Zbl 1062.15001 · doi:10.1109/TIT.2002.801410
[25] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928-961. · Zbl 1092.62031 · doi:10.1214/009053604000000256
[26] Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947-1975. · Zbl 0829.62066 · doi:10.1214/aos/1176325766
[27] Fuchs, J. (2004). On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theory 50 1341-1344. · Zbl 1284.94018 · doi:10.1109/TIT.2004.828141
[28] Greenshtein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10 971-988. · Zbl 1055.62078 · doi:10.3150/bj/1106314846
[29] Haupt, J. and Nowak, R. (2006). Signal reconstruction from noisy random projections. IEEE Trans. Inform. Theory 52 4036-4048. · Zbl 1323.94046 · doi:10.1109/TIT.2006.880031
[30] Kettenring, J., Lindsay, B. and Siegmund, D., eds. (2003). Statistics: Challenges and opportunities for the twenty-first century. NSF report. Available at www.pnl.gov/scales/docs/nsf_report.pdf.
[31] Mallows, C. L. (1973). Some comments on \(C_P\). Technometrics 15 661-675. · Zbl 0269.62061 · doi:10.2307/1267380
[32] Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput. 24 227-234. · Zbl 0827.68054 · doi:10.1137/S0097539792240406
[33] Peters, D. C., Korosec, F. R., Grist, T. M., Block, W. F., Holden, J. E., Vigen, K. K. and Mistretta, C. A. (2000). Undersampled projection reconstruction applied to MR angiography. Magnetic Resonance in Medicine 43 91-101.
[34] Rudin, L. I., Osher, S. and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D 60 259-268. · Zbl 0780.49028 · doi:10.1016/0167-2789(92)90242-F
[35] Sardy, S., Bruce, A. G. and Tseng, P. (2000). Block coordinate relaxation methods for nonparametric wavelet denoising. J. Comput. Graph. Statist. 9 361-379.
[36] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[37] Szarek, S. J. (1991). Condition numbers of random matrices. J. Complexity 7 131-149. · Zbl 0760.15018 · doi:10.1016/0885-064X(91)90002-F
[38] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.