Sparsity in penalized empirical risk minimization. (English) Zbl 1168.62044

Summary: Let \((X,Y)\) be a random couple in \(S\times T\) with unknown distribution \(P\). Let \((X_1,Y_1),\dots, (X_n,Y_n)\) be i.i.d. copies of \((X,Y)\), \(P_n\) being their empirical distribution. Let \(h_1,\dots, h_N:S\mapsto[-1,1]\) be a dictionary consisting of \(N\) functions. For \(\lambda\in\mathbb R^N\), denote \(f_\lambda:=\sum_{j=1}^N \lambda_jh_j\). Let \(\ell:T\times\mathbb R\mapsto\mathbb R\) be a given loss function, which is convex with respect to the second variable. Denote \((\ell\bullet f)(x,y):= \ell(y;f(x))\). We study the following penalized empirical risk minimization problem: \[ \widehat{\lambda}^\varepsilon:= \mathop{\operatorname {argmin}}_{\lambda\in\mathbb R^N}\bigl[P_n(\ell\bullet f_\lambda)+ \varepsilon \|\lambda\|_{\ell_p}^p\bigr], \] which is an empirical version of the problem:
\[ \lambda^\varepsilon:= \mathop{\operatorname {argmin}}_{\lambda\in\mathbb R^N}\bigl[P(\ell\bullet f_\lambda)+ \varepsilon \|\lambda\|_{\ell_p}^p\bigr], \]
(here \(\varepsilon\geq0\) is a regularization parameter; \(\lambda^0\) corresponds to \(\varepsilon=0\)). A number of regression and classification problems fit this general framework. We are interested in the case when \(p\geq1\), but it is close enough to 1 (so that \(p-1\) is of the order \(1/\log N\), or smaller). We show that the “sparsity” of \(\lambda^\varepsilon\) implies the “sparsity” of \(\widehat{\lambda}^\varepsilon\) and study the impact of “sparsity” on bounding the excess risk \(P(\ell\bullet f_{\widehat{\lambda}^\varepsilon})-P(\ell\bullet f_{\lambda^0})\) of solutions of empirical risk minimization problems.


62G30 Order statistics; empirical distribution functions
62G99 Nonparametric inference
60E15 Inequalities; stochastic orderings
62J99 Linear inference, regression
62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI EuDML


[1] A. Barron, L. Birgé and P. Massart. Risk bounds for model selection via penalization. Probab. Theory Related Fields 113 (1999) 301-413. · Zbl 0946.62036 · doi:10.1007/s004400050210
[2] P. Bartlett, O. Bousquet and S. Mendelson. Local Rademacher complexities. Ann. Statist. 33 (2005) 1497-1537. · Zbl 1083.62034 · doi:10.1214/009053605000000282
[3] A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization. Analysis , Algorithms and Engineering Applications . MPS/SIAM, Series on Optimization , Philadelphia, 2001. · Zbl 0986.90032 · doi:10.1137/1.9780898718829
[4] F. Bunea, A. Tsybakov and M. Wegkamp. Aggregation for Gaussian regression. Ann. Statist. 35 (2007) 1674-1697. · Zbl 1209.62065 · doi:10.1214/009053606000001587
[5] F. Bunea, A. Tsybakov and M. Wegkamp. Sparsity oracle inequalities for the LASSO. Electron. J. Statist. 1 (2007) 169-194. · Zbl 1146.62028 · doi:10.1214/07-EJS008
[6] E. Candes and T. Tao. The Dantzig selector statistical estimation when p is much larger than n . Ann. Statist. 35 (2007) 2313-2351. · Zbl 1139.62019 · doi:10.1214/009053606000001523
[7] E. Candes, M. Rudelson, T. Tao and R. Vershynin. Error correction via linear programming. In Proc. 46th Annual IEEE Symposium on Foundations of Computer Science ( FOCS05 ) 295-308. IEEE, Pittsburgh, PA, 2005.
[8] E. Candes, J. Romberg and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math. 59 (2006) 1207-1223. · Zbl 1098.94009 · doi:10.1002/cpa.20124
[9] O. Catoni. Statistical Learning Theory and Stochastic Optimization . Springer, New York, 2004. · Zbl 1076.93002
[10] D. L. Donoho. For most large underdetermined systems of equations the minimal \ell 1 -norm near-solution approximates the sparsest near-solution. Preprint, 2004.
[11] D. L. Donoho. For most large underdetermined systems of linear equations the minimal \ell 1 -norm solution is also the sparsest solution. Comm. Pure Appl. Math. 59 (2006) 797-829. · Zbl 1113.15004 · doi:10.1002/cpa.20132
[12] D. L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory 52 (2006) 1289-1306. · Zbl 1288.94016 · doi:10.1109/TIT.2006.871582
[13] D. L. Donoho, M. Elad and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Inform. Theory 52 (2006) 6-18. · Zbl 1288.94017 · doi:10.1109/TIT.2005.860430
[14] van de S. Geer. High-dimensional generalized linear models and the Lasso. Ann. Statist. 36 (2008) 614-645. · Zbl 1138.62323 · doi:10.1214/009053607000000929
[15] V. Koltchinskii. Model selection and aggregation in sparse classification problems. Oberwolfach Reports Meeting on Statistical and Probabilistic Methods of Model Selection, October, 2005.
[16] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk mnimization. Ann. Statist. 34 (2006) 2593-2656. · Zbl 1118.62065 · doi:10.1214/009053606000001019
[17] V. Koltchinskii and D. Panchenko. Complexities of convex combinations and bounding the generalization error in classification. Ann. Statist. 33 (2005) 1455-1496. · Zbl 1080.62045 · doi:10.1214/009053605000000228
[18] M. Ledoux and M. Talagrand. Probability in Banach Spaces . Springer, New York, 1991. · Zbl 0748.60004
[19] P. Massart. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Tolouse ( IX ) 9 (2000) 245-303. · Zbl 0986.62002 · doi:10.5802/afst.961
[20] P. Massart. Concentration Inequalities and Model Selection . Springer, Berlin, 2007. · Zbl 1170.60006
[21] S. Mendelson, A. Pajor and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in Asymptotic Geometric Analysis. Geom. Funct. Anal. 17 (2007) 1248-1282. · Zbl 1163.46008 · doi:10.1007/s00039-007-0618-7
[22] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the LASSO. Ann. Statist. 34 (2006) 1436-1462. · Zbl 1113.62082 · doi:10.1214/009053606000000281
[23] A. Nemirovski. Topics in non-parametric statistics. In Ecole d’Et’e de Probabilités de Saint-Flour XXVIII , 1998 85-277. P. Bernard (Ed). Springer, New York, 2000. · Zbl 0998.62033
[24] M. Rudelson and R. Vershynin. Geometric approach to error correcting codes and reconstruction of signals. Int. Math. Res. Not. 64 (2005) 4019-4041. · Zbl 1103.94014 · doi:10.1155/IMRN.2005.4019
[25] R. Tibshirani. Regression shrinkage and selection via Lasso. J. Royal Statist. Soc. Ser. B 58 (1996) 267-288. JSTOR: · Zbl 0850.62538
[26] A. Tsybakov. Optimal rates of aggregation. In Proc. 16th Annual Conference on Learning Theory ( COLT ) and 7th Annual Workshop on Kernel Machines , 303-313. Lecture Notes in Artificial Intelligence 2777 . Springer, New York, 2003. · Zbl 1208.62073
[27] van der A. Vaart and J. Wellner. Weak Convergence and Empirical Processes . Springer, New York, 1996. · Zbl 0862.60002
[28] Y. Yang. Mixing strategies for density estimation. Ann. Statist. 28 (2000) 75-87. · Zbl 1106.62322 · doi:10.1214/aos/1016120365
[29] Y. Yang. Aggregating regression procedures for a better performance. Bernoulli 10 (2004) 25-47. · Zbl 1040.62030 · doi:10.3150/bj/1077544602
[30] P. Zhao and B. Yu. On model selection consistency of LASSO. J. Mach. Learn. Res. 7 (2006) 2541-2563. · Zbl 1222.62008
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.