## When do stepwise algorithms meet subset selection criteria?(English)Zbl 1125.62079

Summary: Recent results in homotopy and solution paths demonstrate that certain well-designed greedy algorithms, with a range of values of the algorithmic parameter, can provide solution paths to a sequence of convex optimization problems. On the other hand, in regression many existing criteria in subset selection (including $$C_p$$, AIC, BIC, MDL, RIC, etc.) involve optimizing an objective function that contains a counting measure. The two optimization problems are formulated as (P1) and (P0) in the present paper. The latter is generally combinatoric and has been proven to be NP-hard.
We study the conditions under which the two optimization problems have common solutions. Hence, in these situations a stepwise algorithm can be used to solve the seemingly unsolvable problem. Our main result is motivated by recent work in sparse representations, while two others emerge from different angles: a direct analysis of sufficiency and necessity and a condition on the mostly correlated covariates. An extreme example connected with least angle regression is of independent interest.

### MSC:

 62J99 Linear inference, regression 90C25 Convex programming 65C60 Computational problems in statistics (MSC2010) 62J07 Ridge regression; shrinkage estimators (Lasso)

### Software:

AS 199; PDCO; ElemStatLearn
Full Text:

### References:

 [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (B. N. Petrov and F. Csáki, eds.) 267–281. Akadémiai Kiadó, Budapest. · Zbl 0283.62006 [2] Burnham, K. P. and Anderson, D. A. (2002). Model Selection and Multimodel Inference : A Practical Information-Theoretic Approach , 2nd ed. Springer, New York. · Zbl 1005.62007 [3] Chen, S. S. (1995). Basis pursuit. Ph.D. dissertation, Dept. Statistics, Stanford Univ. [4] Chen, S. S., Donoho, D. L. and Saunders, M. A. (1998). Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 33–61. Reprinted at SIAM Rev. 43 (2001) 129–159., Mathematical Reviews (MathSciNet): · Zbl 0919.94002 [5] DeVore, R. A. and Temlyakov, V. N. (1996). Some remarks on greedy algorithms. Adv. Comput. Math. 5 173–187. · Zbl 0857.65016 [6] Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via $$\ell_1$$ minimization. Proc. Natl. Acad. Sci. USA 100 2197–2202. · Zbl 1064.94011 [7] Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presense of noise. IEEE Trans. Inform. Theory 52 6–18. · Zbl 1288.94017 [8] Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inform. Theory 47 2845–2862. · Zbl 1019.94503 [9] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200–1224. JSTOR: · Zbl 0869.62024 [10] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion). Ann. Statist. 32 407–499. · Zbl 1091.62054 [11] Elad, M. and Bruckstein, A. M. (2002). A generalized uncertainty principle and sparse representation in pairs of bases. IEEE Trans. Inform. Theory 48 2558–2567. · Zbl 1062.15001 [12] Fan, J. and Li, R. (2001). Variable selection via nonconvave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. JSTOR: · Zbl 1073.62547 [13] Fan, J. and Li, R. (2006). Statistical challenges with high dimensionality: Feature selection in knowledge discovery. In International Congress of Mathematicians 3 595–622. European Math. Soc., Zürich. · Zbl 1117.62137 [14] Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961. · Zbl 1092.62031 [15] Foster, D. P. and George, E. I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22 1947–1975. · Zbl 0829.62066 [16] Fuchs, J.-J. (2004). On sparse representations in arbitrary redundant bases. IEEE Trans. Inform. Theory 50 1341–1344. · Zbl 1284.94018 [17] Furnival, G. and Wilson, R. (1974). Regressions by leaps and bounds. Technometrics 16 499–511. · Zbl 0285.05110 [18] Gatu, C. and Kontoghiorghes, E. J. (2006). Branch-and-bound algorithms for computing the best-subset regression models. J. Comput. Graph. Statist. 15 139–156. [19] George, E. I. (2000). The variable selection problem. J. Amer. Statist. Assoc. 95 1304–1308. JSTOR: · Zbl 1018.62050 [20] Gilmour, S. G. (1996). The interpretation of Mallows’s $$C_p$$-statistic. The Statistician 45 49–56. [21] Golub, G. H. and Van Loan, C. F. (1996). Matrix Computations , 3rd ed. Johns Hopkins Univ. Press, Baltimore. · Zbl 0865.65009 [22] Gribonval, R., Figueras i Ventura, R. M. and Vandergheynst, P. (2005). A simple test to check the optimality of sparse signal approximations. In Proc. IEEE International Conference on Acoustics , Speech and Signal Processing ( ICASSP’05 ) 5 717–720. IEEE Press, Piscataway, NJ. Available at lts1pc19.epfl.ch/repository/Gribonval2005_1167.pdf. A longer version is available at ftp.irisa.fr/techreports/2004/PI-1661.pdf. · Zbl 1069.42019 [23] Gribonval, R. and Nielsen, M. (2003). Sparse representations in unions of bases. IEEE Trans. Inform. Theory 49 3320–3325. · Zbl 1286.94032 [24] Hastie, T., Rosset, S., Tibshirani, R. and Zhu, J. (2004). The entire regularization path for the support vector machine. J. Mach. Learn. Res. 5 1391–1415. · Zbl 1222.68213 [25] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning : Data Mining , Inference and Prediction . Springer, New York. · Zbl 0973.62007 [26] Huo, X. and Ni, X. S. (2005). When do stepwise algorithms meet subset selection criteria? Technical report, Georgia Inst. Technology. Available at www2.isye.gatech.edu/statistics/papers/. · Zbl 1125.62079 [27] Jain, A. and Zongker, D. (1997). Feature selection: Evaluation, application and small sample performance. IEEE Trans. Pattern Analysis and Machine Intelligence 19 153–158. [28] Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence 97 273–324. · Zbl 0904.68143 [29] Malioutov, D. M., Cetin, M. and Willsky, A. S. (2005). Homotopy continuation for sparse signal representation. In Proc. IEEE International Conference on Acoustics , Speech and Signal Processing ( ICASSP’05 ) 5 733–736. IEEE Press, Piscataway, NJ. [30] Mallows, C. L. (1973). Some comments on $$C_P$$. Technometrics 15 661–675. · Zbl 0269.62061 [31] Miller, A. J. (2002). Subset Selection in Regression , 2nd ed. Chapman and Hall/CRC, Boca Raton, FL. · Zbl 1051.62060 [32] Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Trans. Comput. 26 917–922. · Zbl 0363.68059 [33] Natarajan, B. K. (1995). Sparse approximate solutions to linear systems. SIAM J. Comput. 24 227–234. · Zbl 0827.68054 [34] Ni, X. S. (2006). New results in detection, estimation and model selection. Ph.D. dissertation, Georgia Inst. Technology. Available at etd.gatech.edu/. [35] Ni, X. S. and Huo, X. (2005). Enhanced leaps-and-bounds methods in subset selections with additional optimality tests. INFORMS QSR student paper competition finalist. Available at qsr.section.informs.org/qsr_activities.htm. [36] Ni, X. S. and Huo, X. (2006). Regressions by enhanced leaps-and-bounds via optimality tests (LBOT). Technical report, Georgia Inst. Technology. Available at www2.isye.gatech.edu/statistics/papers/. [37] Osborne, M. R. (1985). Finite Algorithms in Optimization and Data Analysis . Wiley, Chichester. · Zbl 0573.65044 [38] Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). On the LASSO and its dual. J. Comput. Graph. Statist. 9 319–337. JSTOR: [39] Osborne, M. R., Presnell, B. and Turlach, B. A. (2000). A new approach to variable selection in least squares problems. IMA J. Numer. Anal. 20 389–403. · Zbl 0962.65036 [40] Ridout, M. S. (1988). Algorithm AS 233: An improved branch and bound algorithm for feature subset selection. Appl. Statist. 37 139–147. [41] Roberts, S. J. (1984). Algorithm AS 199: A branch and bound algorithm for determining the optimal feature subset of given size. Appl. Statist. 33 236–241. [42] Rockafellar, R. T. (1970). Convex Analysis . Princeton Univ. Press. · Zbl 0193.18401 [43] Santosa, F. and Symes, W. W. (1986). Linear inversion of band-limited reflection seismograms. SIAM J. Sci. Statist. Comput. 7 1307–1330. · Zbl 0602.73113 [44] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461–464. · Zbl 0379.62005 [45] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. JSTOR: · Zbl 0850.62538 [46] Tropp, J. A. (2004). Greed is good: Algorithmic results for sparse approximation. IEEE Trans. Inform. Theory 50 2231–2242. · Zbl 1288.94019 [47] Tropp, J. A. (2006). Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Inform. Theory 52 1030–1051. · Zbl 1288.94025 [48] Turlach, B. A. (2004). Discussion of “Least angle regression,” by B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. Ann. Statist. 32 481–490. · Zbl 1091.62054 [49] Wu, C.-F. J. (1993). Construction of supersaturated designs through partially aliased interactions. Biometrika 80 661–669. JSTOR: · Zbl 0800.62483 [50] Zou, H. (2005). The adaptive Lasso and its oracle properties. Unpublished manuscript. · Zbl 1171.62326
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.