×

PAC-Bayesian bounds for sparse regression estimation with exponential weights. (English) Zbl 1274.62463

Summary: We consider the sparse regression model where the number of parameters \(p\) is larger than the sample size \(n\). The difficulty when considering high-dimensional problems is to propose estimators achieving a good compromise between statistical and computational performances. The Lasso is solution of a convex minimization problem, hence computable for large value of \(p\). However stringent conditions on the design are required to establish fast rates of convergence for this estimator. A. S. Dalalyan and A. B. Tsybakov [J. Comput. Syst. Sci. 78, No. 5, 1423–1443 (2012; Zbl 1244.62054); Bernoulli 18, No. 3, 914–944 (2012; Zbl 1243.62008); “Aggregation by exponential weighting, sharp oracle inequalities and sparsity”, Machine Learning 72, No. 1–2, 39–61 (2008), cf. also Zbl 1203.62063] proposed an exponential weights procedure achieving a good compromise between the statistical and computational aspects. This estimator can be computed for reasonably large \(p\) and satisfies a sparsity oracle inequality in expectation for the empirical excess risk only under mild assumptions on the design. In this paper, we propose an exponential weights estimator similar to that of [Dalalyan and Tsybakov, loc. cit. (2008)] but with improved statistical performances. Our main result is a sparsity oracle inequality in probability for the true excess risk.

MSC:

62J07 Ridge regression; shrinkage estimators (Lasso)
62J05 Linear regression; mixed models
62G08 Nonparametric regression and quantile regression
62F15 Bayesian inference
62B10 Statistical aspects of information-theoretic topics
68T05 Learning and adaptive systems in artificial intelligence

Software:

Bolasso; PDCO
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle., 2nd International Symposium on Information Theory . Budapest: Akademia Kiado. B. N. Petrov and F. Csaki, editors. 267-281. · Zbl 0283.62006
[2] Alquier, P. (2006)., Transductive and Inductive Adaptative Inference for Regression and Density Estimation. PhD Thesis, Université Paris 6.
[3] Alquier, P. (2008). PAC-Bayesian bounds for randomized empirical risk minimizers., Mathematical Methods of Statistics 17 (4) 279-304. · Zbl 1260.62038 · doi:10.3103/S1066530708040017
[4] Audibert, J.-Y. (2004). Aggregated Estimators and Empirical Complexity for Least Square Regression., Annales de l’Institut Henri Poincaré B: Probability and Statistics 40 (6) 685-736. · Zbl 1052.62037 · doi:10.1016/j.anihpb.2003.11.006
[5] Audibert, J.-Y. (2004)., PAC-Bayesian Statistical Learning Theory. PhD Thesis, Université Paris 6.
[6] Bach, F. (2008). Bolasso: model consistent Lasso estimation through the bootstrap., Proceedings of the 25th international conference on Machine learning , ACM, New-York.
[7] Bickel, P. J., Ritov, Y. and Tsybakov, A. (2009). Simultaneous Analysis of Lasso and Dantzig Selector., The Annals of Statistics 37 (4) 1705-1732. · Zbl 1173.62022 · doi:10.1214/08-AOS620
[8] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Sparsity oracle inequalities for the Lasso., Electronic Journal of Statistics 1 169-194. · Zbl 1146.62028 · doi:10.1214/07-EJS008
[9] Bunea, F., Tsybakov, A. and Wegkamp, M. (2007). Aggregation for Gaussian regression., The Annals of Statistics 35 1674-1697. · Zbl 1209.62065 · doi:10.1214/009053606000001587
[10] Cai, T., Xu, G. and Zhang, J. (2009). On Recovery of Sparse Signals via, \ell 1 Minimization. IEEE Transactions on Information Theory 55 3388-3397. · Zbl 1367.94081 · doi:10.1109/TIT.2009.2021377
[11] Candès, E. and Tao, T. (2007) The Dantzig selector: statistical estimation when, p is much larger than n . The Annals of Statistics 35 . · Zbl 1139.62019 · doi:10.1214/009053606000001523
[12] Catoni, O. (2003), A PAC-Bayesian approach to adaptative classification. Preprint Laboratoire de Probabilités et Modèles Aléatoires PMA-840. · Zbl 1012.60023 · doi:10.1016/S0246-0203(02)00017-1
[13] Catoni, O. (2004), Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics, Springer. · Zbl 1076.93002 · doi:10.1007/b99352
[14] Catoni, O. (2007), PAC-Bayesian Supervised Classification (The Thermodynamics of Statistical Learning). IMS Lecture Notes - Monograph Series 56 . · Zbl 1277.62015
[15] Chen, S. S., Donoho, D. L. and Saunders, M. A. (2001). Atomic Decomposition by Basis Pursuit., SIAM Review 43 (1) 129-159. · Zbl 0979.94010 · doi:10.1137/S003614450037906X
[16] Dalalyan, A. and Tsybakov, A. (2007). Aggregation by exponential weighting and sharp oracle inequalities., Proceedings of the 20th annual conference on Learning theory , Springer-Verlag Berlin, Heidelberg. · Zbl 1203.62063 · doi:10.1007/978-3-540-72927-3_9
[17] Dalalyan, A. and Tsybakov, A. (2008). Aggregation by exponential weighting, sharp oracle inequalities and sparsity., Machine Learning 72 (1-2) 39-61.
[18] Dalalyan, A. and Tsybakov, A. (2010)., Mirror averaging with sparsity priors. In minor revision for Bernoulli, · Zbl 1243.62008
[19] Dalalyan, A. and Tsybakov, A. (2010)., Sparse Regression Learning by Aggregation and Langevin Monte-Carlo. · Zbl 1244.62054
[20] Dembo, A. and Zeitouni, O. (1998)., Large Deviation Techniques and Applications. Springer. · Zbl 0896.60013
[21] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least Angle Regression., The Annals of Statistics 32 (2) 407-499. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[22] Frank, L. and Friedman, J. (1993). A Statistical View on Some Chemometrics Regression Tools., Technometrics 16 199-511. · Zbl 0775.62288
[23] Ghosh, S. (2007)., Adative Elastic Net: an Improvement of Elastic Net to achieve Oracle Properties. · Zbl 1136.90419 · doi:10.1007/978-3-540-72792-7_24
[24] Juditsky, A., Rigollet, P. and Tsybakov, A. (2008). Learning by mirror averaging., The Annals of Statistics 36 (5) 2183-2206. · Zbl 1274.62288 · doi:10.1214/07-AOS546
[25] Koltchinskii, V. (2010). Sparsity in Empirical Risk Minimization., Annales de l’Institut Henri Poincaré B: Probability and Statistics,
[26] Koltchinskii, V. (2009). The Dantzig selector and sparsity oracle inequalities., Bernoulli 15 (3) 799-828. · Zbl 1452.62486 · doi:10.3150/09-BEJ187
[27] Lehmann, E. L. and Casella, G. (1998)., Theory of Point Estimation. Springer. · Zbl 0916.62017
[28] Leung, G. and Barron, A. R. (2006). Information theory and mixing least-squares regressions., IEEE Transactions on Information Theory 52 (8) 3396-3410. · Zbl 1309.94051 · doi:10.1109/TIT.2006.878172
[29] Lounici, K. (2007). Generalized mirror averaging and, d -convex aggregation. Mathematical Methods of Statistics 16 (3). · Zbl 1231.62046 · doi:10.3103/S1066530707030040
[30] Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of Lasso and Dantzig estimators., Electronic Journal of Statistics 2 90-102. · Zbl 1306.62155 · doi:10.1214/08-EJS177
[31] Mallows, C. L. (1973). Some comments on, c p . Technometrics 15 661-676. · Zbl 0269.62061 · doi:10.2307/1267380
[32] Marin, J.-M. and Robert, C. P. (2007)., Bayesian Core: A practical approach to computational Bayesian analysis. Springer. · Zbl 1137.62013
[33] Massart, P. (2007)., Concentration Inequalities and Model Selection. Saint-Flour Summer School on Probability Theory 2003 (Jean Picard ed.), Lecture Notes in Mathematics, Springer. · Zbl 1170.60006 · doi:10.1007/978-3-540-48503-2
[34] McAllester D. A. (1998). Some pac-bayesian theorems., Proceedings of the Eleventh Annual Conference on Computational Learning Theory. ACM, 230-234. · Zbl 0945.68157 · doi:10.1023/A:1007618624809
[35] Meinshausen, N. and Bühlmann, P. (2010). Stability selection., Journal of the Royal Statistical Society B , · Zbl 1113.62082 · doi:10.1111/j.1467-9868.2010.00740.x
[36] Rigollet, P. and Tsybakov, A. (2010). Exponential Screening and optimal rates of sparse estimation., The Annals of Statistics, · Zbl 1215.62043 · doi:10.1214/10-AOS854
[37] Schwarz, G. (1978). Estimating the dimension of a model., The Annals of Statistics, 6 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[38] Shawe-Taylor, J. and Williamson, R. (1997). A PAC analysis of a bayes estimator., Proceedings of the Tenth Annual Conference on Computational Learning Theory. ACM, 2-9.
[39] Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO., Journal of the Royal Statistical Society B 58 (1) 267-288. · Zbl 0850.62538
[40] Tsybakov, A. (2003). Optimal rates of aggregation., Computationnal Learning theory and Kernel Machines , Lecture Notes in Artificial Intelligence n.2777, Springer, Heidelberg. 303-313. · Zbl 1208.62073
[41] Tsybakov, A. (2009)., Introduction to nonparametric estimation. Spinger, New York. · Zbl 1176.62032
[42] Van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the LASSO., Electronic Journal of Statistics 3 1360-1392. · Zbl 1327.62425 · doi:10.1214/09-EJS506
[43] Yang, Y. (2004). Aggregating regression procedures to improve performance., Bernoulli , 10 (1) 25-47. · Zbl 1040.62030 · doi:10.3150/bj/1077544602
[44] Zhang, T. (2008). From epsilon-entropy to KL-entropy: analysis of minimum information complexity density estimation., The Annals of Statistics 34 2180-2210. · Zbl 1106.62005 · doi:10.1214/009053606000000704
[45] Zou, H. (2006). The adaptive lasso and its oracle properties., Journal of the American Statistical Association , 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[46] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net., Journal of the Royal Statistical Society B 67 (2) 301-320. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.