Structured sparsity through convex optimization. (English) Zbl 1331.90050

Summary: Sparse estimation methods are aimed at using or obtaining parsimonious representations of data or models. While naturally cast as a combinatorial optimization problem, variable or feature selection admits a convex relaxation through the regularization by the \(\ell_{1}\)-norm. In this paper, we consider situations where we are not only interested in sparsity, but where some structural prior knowledge is available as well. We show that the \(\ell_{1}\)-norm can then be extended to structured norms built on either disjoint or overlapping groups of variables, leading to a flexible framework that can deal with various structures. We present applications to unsupervised learning, for structured sparse principal component analysis and hierarchical dictionary learning, and to supervised learning in the context of nonlinear variable selection.


90C25 Convex programming
62J07 Ridge regression; shrinkage estimators (Lasso)
62J02 General nonlinear regression
62H25 Factor analysis and principal components; correspondence analysis
Full Text: DOI arXiv Euclid


[1] Adams, R., Ghahramani, Z. and Jordan, M. (2010). Tree-structured stick breaking for hierarchical data. In Advances in Neural Information Processing Systems 23 (J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel and A. Culotta, eds.) 19-27.
[2] Aharon, M., Elad, M. and Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Processing 54 4311-4322. · Zbl 1375.94040
[3] Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. J. Mach. Learn. Res. 9 1179-1225. · Zbl 1225.68147
[4] Bach, F. (2009). Exploring large feature spaces with hierarchical multiple kernel learning. In Neural Information Processing Systems 21 .
[5] Bach, F. (2010). Structured sparsity-inducing norms through submodular functions. In Advances in Neural Information Processing Systems 23 .
[6] Bach, F. (2011a). Learning with submodular functions: A convex optimization perspective. Technical Report No. 00645271, HAL. · Zbl 1280.68001 · doi:10.1561/2200000039
[7] Bach, F. (2011b). Shaping level sets with submodular functions. In Advances in Neural Information Processing Systems 24 .
[8] Bach, F., Mairal, J. and Ponce, J. (2008). Convex sparse matrix factorizations. Technical report. Preprint. Available at . arXiv:0812.1869
[9] Bach, F., Jenatton, R., Mairal, J. and Obozinski, G. (2012). Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning 4 1-106. · Zbl 06064248
[10] Baraniuk, R. G., Cevher, V., Duarte, M. F. and Hegde, C. (2010). Model-based compressive sensing. IEEE Trans. Inform. Theory 56 1982-2001. · Zbl 1366.94215 · doi:10.1109/TIT.2010.2040894
[11] Beck, A. and Teboulle, M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2 183-202. · Zbl 1175.94009 · doi:10.1137/080716542
[12] Becker, S., Bobin, J. and Candès, E. J. (2011). NESTA: A fast and accurate first-order method for sparse recovery. SIAM J. Imaging Sci. 4 1-39. · Zbl 1209.90265 · doi:10.1137/090756855
[13] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 1705-1732. · Zbl 1173.62022 · doi:10.1214/08-AOS620
[14] Blei, D. M., Griffiths, T. L. and Jordan, M. I. (2010). The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J. ACM 57 1-30. · Zbl 1327.68187 · doi:10.1145/1667053.1667056
[15] Blei, D., Ng, A. and Jordan, M. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993-1022. · Zbl 1112.68379 · doi:10.1162/jmlr.2003.3.4-5.993
[16] Bondell, H. D. and Reich, B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. Biometrics 64 115-123. · Zbl 1146.62051 · doi:10.1111/j.1541-0420.2007.00843.x
[17] Borwein, J. M. and Lewis, A. S. (2006). Convex Analysis and Nonlinear Optimization. Theory and Examples , 2nd ed. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC 3 . Springer, New York. · Zbl 1116.90001 · doi:10.1007/978-0-387-31256-9
[18] Buntine, W. L. (2002). Variational extensions to EM and multinomial PCA. In Proceedings of the European Conference on Machine Learning ( ECML ). · Zbl 1014.68524 · doi:10.1007/3-540-36755-1_3
[19] Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Trans. Inform. Theory 51 4203-4215. · Zbl 1264.94121 · doi:10.1109/TIT.2005.858979
[20] Cevher, V., Duarte, M. F., Hegde, C. and Baraniuk, R. G. (2008). Sparse signal recovery using Markov random fields. In Advances in Neural Information Processing Systems 20 .
[21] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33-61. · Zbl 0919.94002 · doi:10.1137/S1064827596304010
[22] Chen, X., Lin, Q., Kim, S., Carbonell, J. G. and Xing, E. P. (2011). Smoothing proximal gradient method for general structured sparse learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence ( UAI ). · Zbl 1243.62100 · doi:10.1214/11-AOAS514
[23] Combettes, P. L. and Pesquet, J. C. (2010). Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering . Springer, New York. · Zbl 1242.90160 · doi:10.1007/978-1-4419-9569-8_10
[24] d’Aspremont, A., Bach, F. and El Ghaoui, L. (2008). Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res. 9 1269-1294. · Zbl 1225.68170
[25] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200-1224. · Zbl 0869.62024 · doi:10.2307/2291512
[26] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-499. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[27] Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group Lasso and a sparse group Lasso.
[28] Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Stat. 1 302-332. · Zbl 1378.90064 · doi:10.1214/07-AOAS131
[29] Gramfort, A. and Kowalski, M. (2009). Improving M/EEG source localization with an inter-condition sparse prior. In IEEE International Symposium on Biomedical Imaging .
[30] Griffiths, T. L. and Steyvers, M. (2004). Finding scientific topics. Proc. Natl. Acad. Sci. USA 101 5228-5235.
[31] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining , Inference , and Prediction . Springer, New York. · Zbl 0973.62007
[32] Huang, J. and Zhang, T. (2010). The benefit of group sparsity. Ann. Statist. 38 1978-2004. · Zbl 1202.62052 · doi:10.1214/09-AOS778
[33] Huang, J., Zhang, T. and Metaxas, D. (2011). Learning with structured sparsity. J. Mach. Learn. Res. 12 3371-3412. · Zbl 1280.68169
[34] Jacob, L., Obozinski, G. and Vert, J. P. (2009). Group Lasso with overlaps and graph Lasso. In Proceedings of the International Conference on Machine Learning ( ICML ).
[35] Jenatton, R., Audibert, J.-Y. and Bach, F. (2011). Structured variable selection with sparsity-inducing norms. J. Mach. Learn. Res. 12 2777-2824. · Zbl 1280.68170
[36] Jenatton, R., Obozinski, G. and Bach, F. (2010). Structured sparse principal component analysis. In International Conference on Artificial Intelligence and Statistics ( AISTATS ).
[37] Jenatton, R., Gramfort, A., Michel, V., Obozinski, G., Eger, E., Bach, F. and Thirion, B. (2011a). Multi-scale mining of fMRI data with hierarchical structured sparsity. SIAM J. Imaging Sci. To appear. Technical report. Preprint. Available at . arXiv:1105.0363 · Zbl 1263.90059 · doi:10.1137/110832380
[38] Jenatton, R., Mairal, J., Obozinski, G. and Bach, F. (2011b). Proximal methods for hierarchical sparse coding. J. Mach. Learn. Res. 12 2297-2334. · Zbl 1280.94029
[39] Jolliffe, I. T., Trendafilov, N. T. and Uddin, M. (2003). A modified principal component technique based on the LASSO. J. Comput. Graph. Statist. 12 531-547. · doi:10.1198/1061860032148
[40] Kavukcuoglu, K., Ranzato, M. A., Fergus, R. and LeCun, Y. (2009). Learning invariant features through topographic filter maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ( CVPR ).
[41] Kim, S., Sohn, K. A. and Xing, E. P. (2009). A multivariate regression approach to association analysis of a quantitative trait network. Bioinformatics 25 204-212.
[42] Kim, S. and Xing, E. P. (2010). Tree-guided group Lasso for multi-task regression with structured sparsity. In Proceedings of the International Conference on Machine Learning ( ICML ).
[43] Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl. 33 82-95. · Zbl 0201.39702 · doi:10.1016/0022-247X(71)90184-3
[44] Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401 788-791. · Zbl 1055.81054 · doi:10.1142/S0217732304015300
[45] Lin, Y. and Zhang, H. H. (2006). Component selection and smoothing in multivariate nonparametric regression. Ann. Statist. 34 2272-2297. · Zbl 1106.62041 · doi:10.1214/009053606000000722
[46] Liu, H., Palatucci, M. and Zhang, J. (2009). Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery. In Proceedings of the International Conference on Machine Learning ( ICML ).
[47] Lounici, K., Pontil, M., Tsybakov, A. B. and van de Geer, S. (2009). Taking advantage of sparsity in multi-task learning. In Proceedings of the Conference on Learning Theory .
[48] Lounici, K., Pontil, M., van de Geer, S. and Tsybakov, A. B. (2011). Oracle inequalities and optimal inference under group sparsity. Ann. Statist. 39 2164-2204. · Zbl 1306.62156 · doi:10.1214/11-AOS896
[49] Mackey, L. (2009). Deflation methods for sparse PCA. In Advances in Neural Information Processing Systems 21 .
[50] Mairal, J. (2010). Sparse coding for machine learning, image processing and computer vision. Ph.D. thesis, École normale supérieure de Cachan-ENS Cachan. Available at .
[51] Mairal, J., Bach, F., Ponce, J. and Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11 19-60. · Zbl 1242.62087
[52] Mairal, J., Jenatton, R., Obozinski, G. and Bach, F. (2011). Convex and network flow optimization for structured sparsity. J. Mach. Learn. Res. 12 2681-2720. · Zbl 1280.68179
[53] Mallat, S. G. (1999). A Wavelet Tour of Signal Processing . Academic Press, New York. · Zbl 0998.94510
[54] Martinez, A. M. and Kak, A. C. (2001). PCA versus LDA. In IEEE Transactions on Pattern Analysis and Machine Intelligence 23 228-233.
[55] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann. Statist. 34 1436-1462. · Zbl 1113.62082 · doi:10.1214/009053606000000281
[56] Moghaddam, B., Weiss, Y. and Avidan, S. (2006). Spectral bounds for sparse PCA: Exact and greedy algorithms. In Advances in Neural Information Processing Systems 18 .
[57] Moreau, J.-J. (1962). Fonctions convexes duales et points proximaux dans un espace hilbertien. C. R. Acad. Sci. Paris 255 2897-2899. · Zbl 0118.10502
[58] Needell, D. and Tropp, J. A. (2009). CoSaMP: Iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal. 26 301-321. · Zbl 1163.94003 · doi:10.1016/j.acha.2008.07.002
[59] Negahban, S. N. and Wainwright, M. J. (2011). Simultaneous support recovery in high dimensions: Benefits and perils of block \(\ell_{1}/\ell_{\infty}\)-regularization. IEEE Trans. Inform. Theory 57 3841-3863. · Zbl 1365.62274 · doi:10.1109/TIT.2011.2144150
[60] Negahban, S., Ravikumar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. In Advances in Neural Information Processing Systems 22 . · Zbl 1331.62350
[61] Nesterov, Y. (2004). Introductory Lectures on Convex Optimization. A Basic Course. Applied Optimization 87 . Kluwer Academic, Boston, MA. · Zbl 1086.90045
[62] Nesterov, Y. (2007). Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic Univ. Louvain.
[63] Obozinski, G. and Bach, F. (2012). Convex relaxation for combinatorial penalties. Technical report, HAL.
[64] Obozinski, G., Jacob, L. and Vert, J. P. (2011). Group Lasso with overlaps: The latent group Lasso approach. Technical Report No. inria-00628498, HAL.
[65] Obozinski, G., Taskar, B. and Jordan, M. I. (2010). Joint covariate selection and joint subspace selection for multiple classification problems. Stat. Comput. 20 231-252. · doi:10.1007/s11222-008-9111-x
[66] Obozinski, G., Wainwright, M. J. and Jordan, M. I. (2011). Support union recovery in high-dimensional multivariate regression. Ann. Statist. 39 1-47. · Zbl 1373.62372 · doi:10.1214/09-AOS776
[67] Olshausen, B. A. and Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 607-609.
[68] Percival, D. (2012). Theoretical properties of the overlapping group Lasso. Electron. J. Statist. 6 269-288. · Zbl 1334.62131 · doi:10.1214/12-EJS672
[69] Quattoni, A., Carreras, X., Collins, M. and Darrell, T. (2009). An efficient projection for \(\ell_{1}/\ell_{\infty}\) regularization. In Proceedings of the International Conference on Machine Learning ( ICML ).
[70] Rao, N. S., Nowak, R. D., Wright, S. J. and Kingsbury, N. G. (2011). Convex approaches to model wavelet sparsity patterns. In International Conference on Image Processing ( ICIP ).
[71] Rapaport, F., Barillot, E. and Vert, J.-P. (2008). Classification of arrayCGH data using fused SVM. Bioinformatics 24 i375-i382.
[72] Ravikumar, P., Lafferty, J., Liu, H. and Wasserman, L. (2009). Sparse additive models. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 1009-1030. · doi:10.1111/j.1467-9868.2009.00718.x
[73] Roth, V. and Fischer, B. (2008). The group-Lasso for generalized linear models: Uniqueness of solutions and efficient algorithms. In Proceedings of the International Conference on Machine Learning ( ICML ).
[74] Rudin, L. I., Osher, S. and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Phys. D 60 259-268. · Zbl 0780.49028 · doi:10.1016/0167-2789(92)90242-F
[75] Schmidt, M., Le Roux, N. and Bach, F. (2011). Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems 24 .
[76] Schmidt, M. and Murphy, K. (2010). Convex structure learning in log-linear models: Beyond pairwise potentials. In Proceedings of the International Conference on Artificial Intelligence and Statistics ( AISTATS ).
[77] Shalev-Shwartz, S., Srebro, N. and Zhang, T. (2010). Trading accuracy for sparsity in optimization problems with sparsity constraints. SIAM J. Optim. 20 2807-2832. · Zbl 1208.68226 · doi:10.1137/090759574
[78] Shawe-Taylor, J. and Cristianini, N. (2004). Kernel Methods for Pattern Analysis . Cambridge Univ. Press, Cambridge. · Zbl 0994.68074
[79] Shen, X. and Huang, H.-C. (2010). Grouping pursuit through a regularization solution surface. J. Amer. Statist. Assoc. 105 727-739. · Zbl 1392.62192 · doi:10.1198/jasa.2010.tm09380
[80] Singh, A. P. and Gordon, G. J. (2008). A unified view of matrix factorization models. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases .
[81] Sprechmann, P., Ramirez, I., Sapiro, G. and Eldar, Y. (2010). Collaborative hierarchical sparse modeling. In 44 th Annual Conference on Information Sciences and Systems ( CISS ) 1-6. IEEE.
[82] Stojnic, M., Parvaresh, F. and Hassibi, B. (2009). On the reconstruction of block-sparse signals with an optimal number of measurements. IEEE Trans. Signal Process. 57 3075-3085. · Zbl 1391.94402 · doi:10.1109/TSP.2009.2020754
[83] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[84] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 91-108. · Zbl 1060.62049 · doi:10.1111/j.1467-9868.2005.00490.x
[85] Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory 50 2231-2242. · Zbl 1288.94019 · doi:10.1109/TIT.2004.834793
[86] Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030-1051. · Zbl 1288.94025 · doi:10.1109/TIT.2008.2009806
[87] Turlach, B. A., Venables, W. N. and Wright, S. J. (2005). Simultaneous variable selection. Technometrics 47 349-363. · doi:10.1198/004017005000000139
[88] van de Geer, S. (2010). \(\ell_{1}\)-regularization in high-dimensional statistical models. In Proceedings of the International Congress of Mathematicians. Volume IV 2351-2369. Hindustan Book Agency, New Delhi. · Zbl 1513.62068
[89] Varoquaux, G., Jenatton, R., Gramfort, A., Obozinski, G., Thirion, B. and Bach, F. (2010). Sparse structured dictionary learning for brain resting-state activity modeling. In NIPS Workshop on Practical Applications of Sparse Modeling : Open Issues and New Directions .
[90] Wainwright, M. J. (2009). Sharp thresholds for high-dimensional and noisy sparsity recovery using \(\ell_{1}\)-constrained quadratic programming (Lasso). IEEE Trans. Inform. Theory 55 2183-2202. · Zbl 1367.62220 · doi:10.1109/TIT.2009.2016018
[91] Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515-534.
[92] Wright, S. J., Nowak, R. D. and Figueiredo, M. A. T. (2009). Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57 2479-2493. · Zbl 1391.94442 · doi:10.1109/TSP.2009.2016892
[93] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2 224-244. · Zbl 1137.62045 · doi:10.1214/07-AOAS174SUPP
[94] Xiang, Z. J., Xi, Y. T., Hasson, U. and Ramadge, P. J. (2009). Boosting with spatial regularization. In Advances in Neural Information Processing Systems 22 .
[95] Yuan, M. (2010). High dimensional inverse covariance matrix estimation via linear programming. J. Mach. Learn. Res. 11 2261-2286. · Zbl 1242.62043
[96] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49-67. · Zbl 1141.62030 · doi:10.1111/j.1467-9868.2005.00532.x
[97] Yuan, G.-X., Chang, K.-W., Hsieh, C.-J. and Lin, C.-J. (2010). A comparison of optimization methods and software for large-scale L1-regularized linear classification. J. Mach. Learn. Res. 11 3183-3234. · Zbl 1242.62065
[98] Zass, R. and Shashua, A. (2007). Nonnegative sparse PCA. In Advances in Neural Information Processing Systems 19 .
[99] Zhang, T. (2009). Some sharp performance bounds for least squares regression with \(L_{1}\) regularization. Ann. Statist. 37 2109-2144. · Zbl 1173.62029 · doi:10.1214/08-AOS659
[100] Zhao, P., Rocha, G. and Yu, B. (2009). The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist. 37 3468-3497. · Zbl 1369.62164 · doi:10.1214/07-AOS584
[101] Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541-2563. · Zbl 1222.62008
[102] Zhong, L. W. and Kwok, J. T. (2011). Efficient sparse modeling with automatic feature grouping. In Proceedings of the International Conference on Machine Learning ( ICML ).
[103] Zhou, Y., Jin, R. and Hoi, S. C. H. (2010). Exclusive Lasso for multi-task feature selection. In Proceedings of the International Conference on Artificial Intelligence and Statistics ( AISTATS ).
[104] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 · doi:10.1198/016214506000000735
[105] Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265-286. · doi:10.1198/106186006X113430
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.