×

Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with OSCAR. (English) Zbl 1146.62051

Summary: Variable selection can be challenging, particularly in situations with a large number of predictors with possibly high correlations, such as gene expression data. In this article, a new method, called OSCAR (octagonal shrinkage and clustering algorithm for regression), is proposed to simultaneously select variables while grouping them into predictive clusters. In addition to improving prediction accuracy and interpretation, these resulting groups can then be investigated further to discover what contributes to the group having a similar behavior. The technique is based on penalized least squares with a geometrically intuitive penalty function that shrinks some coefficients to exactly zero. Additionally, this penalty yields exact equality of some coefficients, encouraging correlated predictors that have a similar effect on the response to form predictive clusters represented by a single coefficient. The proposed procedure is shown to compare favorably to the existing shrinkage and variable selection techniques in terms of both prediction error and model complexity, while yielding the additional grouping information.

MSC:

62J07 Ridge regression; shrinkage estimators (Lasso)
62P12 Applications of statistics to environmental and related topics
65C60 Computational problems in statistics (MSC2010)
90C90 Applications of mathematical programming
62J05 Linear regression; mixed models

Software:

OSCAR; SQOPT

References:

[1] Dettling, Finding predictive gene groups from microarray data, Journal of Multivariate Analysis 90 pp 106– (2004) · Zbl 1047.62103 · doi:10.1016/j.jmva.2004.02.012
[2] Efron, Least angle regression, Annals of Statistics 32 pp 407– (2004) · Zbl 1091.62054 · doi:10.1214/009053604000000067
[3] Gill, Users guide for SQOPT 7: A Fortran package for large-scale linear and quadratic programming (2005)
[4] Hastie, Supervised harvesting of expression trees, Genome Biology 2 (1) pp 3.1– (2001) · doi:10.1186/gb-2001-2-1-research0003
[5] Jörnsten, Simultaneous gene clustering and subset selection for sample classification via MDL, Bioinformatics 19 pp 1100– (2003) · doi:10.1093/bioinformatics/btg039
[6] Marshall, A multivariate exponential distribution, Journal of the American Statistical Association 62 pp 30– (1967) · Zbl 0147.38106 · doi:10.2307/2282907
[7] Park, Biostatistics 8 pp 212– (2007)
[8] Rosset, Piecewise linear regularized solution paths, Annals of Statistics 35 (2007) · Zbl 1194.62094 · doi:10.1214/009053606000001370
[9] Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B 58 pp 267– (1996) · Zbl 0850.62538
[10] Tibshirani, Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society, Series B 67 pp 91– (2005) · Zbl 1060.62049 · doi:10.1111/j.1467-9868.2005.00490.x
[11] Yuan, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society, Series B 68 pp 49– (2006) · Zbl 1141.62030 · doi:10.1111/j.1467-9868.2005.00532.x
[12] Zou, Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society, Series B 67 pp 301– (2005) · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
[13] Zou, The F-norm support vector machine (2006)
[14] Zou, On the degrees of freedom of the lasso (2004)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.