The composite absolute penalties family for grouped and hierarchical variable selection. (English) Zbl 1369.62164

Summary: Extracting useful information from high-dimensional data is an important focus of today’s statistical research and practice. Penalized loss function minimization has been shown to be effective for this task both theoretically and empirically. With the virtues of both regularization and sparsity, the \(L_{1}\)-penalized squared error minimization method Lasso has been popular in regression models and beyond.
In this paper, we combine different norms including \(L_{1}\) to form an intelligent penalty in order to add side information to the fitting of a regression or classification model to obtain reasonable estimates. Specifically, we introduce the Composite Absolute Penalties (CAP) family, which allows given grouping and hierarchical relationships between the predictors to be expressed. CAP penalties are built by defining groups and combining the properties of norm penalties at the across-group and within-group levels. Grouped selection occurs for nonoverlapping groups. Hierarchical variable selection is reached by defining groups with particular overlapping patterns. We propose using the BLASSO and cross-validation to compute CAP estimates in general. For a subfamily of CAP estimates involving only the \(L_{1}\) and \(L_{\infty }\) norms, we introduce the iCAP algorithm to trace the entire regularization path for the grouped selection problem. Within this subfamily, unbiased estimates of the degrees of freedom (df) are derived so that the regularization parameter is selected without cross-validation. CAP is shown to improve on the predictive performance of the LASSO in a series of simulated experiments, including cases with \(p\gg n\) and possibly mis-specified groupings. When the complexity of a model is properly calculated, iCAP is seen to be parsimonious in the experiments.


62J07 Ridge regression; shrinkage estimators (Lasso)
Full Text: DOI arXiv


[1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Proc. 2nd International Symposium on Information Theory 267-281. · Zbl 0283.62006
[2] Boyd, S. and Vandenberghe, L. (2004). Convex Optimization . Cambridge Univ. Press, Cambridge. · Zbl 1058.90049
[3] Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics 37 373-384. JSTOR: · Zbl 0862.62059
[4] Chen, S., Donoho, D. and Saunders, M. (2001). Atomic decomposition by basis pursuit. SIAM Rev. 43 129-159. JSTOR: · Zbl 0979.94010
[5] Donoho, D. and Johnstone, I. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425-455. JSTOR: · Zbl 0815.62019
[6] Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans . SIAM, Philadelphia. · Zbl 0496.62036
[7] Efron, B. (2004). The estimation of prediction error covariance penalties and cross-validation. J. Amer. Statist. Assoc. 99 619-632. · Zbl 1117.62324
[8] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 35 407-499. · Zbl 1091.62054
[9] Frank, I. E. and Friedman, J. (1993). A statistical view of some chemometrics regression tools. Technometrics 35 109-148. · Zbl 0775.62288
[10] Freund, Y. and Schapire, R. E. (1997). A decision theoretic generalization of online learning and an application to boosting. J. Comput. System Sci. 55 119-139. · Zbl 0880.68103
[11] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531-537.
[12] Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Biased estimation of nonorthogonal problems. Technometrics 12 55-67. · Zbl 0202.17205
[13] Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups in Data: An Introduction to Cluster Analysis . Wiley, New York. · Zbl 1345.62009
[14] Kim, Y., Kim, J. and Kim, Y. (2006). Blockwise sparse regression. Statist. Sinica 16 375-390. · Zbl 1096.62076
[15] Mallows, C. L. (1973). Some comments on c p . Technometrics 15 661-675. · Zbl 0269.62061
[16] Obozinski, G. and Jordan, M. (2009). Multi-task feature selection. J. Stat. Comput.
[17] Osborne, M., Presnell, B. and Turlach, B. (2000). A new approach to variable selection in least square problems. IMA J. Numer. Anal. 20 389-404. · Zbl 0962.65036
[18] Rosset, S. and Zhu, J. (2007). Piecewise linear regularized solution paths. Ann. Statist. 35 1012-1030. · Zbl 1194.62094
[19] Schwartz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005
[20] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. J. Roy. Statist. Soc. Ser. B Methodol. 36 111-147. JSTOR: · Zbl 0308.62063
[21] Sugiura, N. (1978). Further analysis of the data by Akaike’s information criterion and finite corrections. Comm. Statist. A7 13-26. · Zbl 0382.62060
[22] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. JSTOR: · Zbl 0850.62538
[23] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68 49-67. · Zbl 1141.62030
[24] Zhao, P. and Yu, B. (2007). Stagewise Lasso. J. Mach. Learn. Res. 8 2701-2726. · Zbl 1222.68345
[25] Zhao, P., Rocha, G. and Yu, B. (2006). Grouped and hierarchical model selection through composite absolute penalties. Technical Report 703, Dept. Statistics, UC Berkeley.
[26] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301-320. JSTOR: · Zbl 1069.62054
[27] Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the Lasso. Ann. Statist. 35 2173-2192. · Zbl 1126.62061
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.