## Sparse least trimmed squares regression for analyzing high-dimensional large data sets.(English)Zbl 1454.62123

Summary: Sparse model estimation is a topic of high importance in modern data analysis due to the increasing availability of data sets with a large number of variables. Another common problem in applied statistics is the presence of outliers in the data. This paper combines robust regression and sparse model estimation. A robust and sparse estimator is introduced by adding an $$L_{1}$$ penalty on the coefficient estimates to the well-known least trimmed squares (LTS) estimator. The breakdown point of this sparse LTS estimator is derived, and a fast algorithm for its computation is proposed. In addition, the sparse LTS is applied to protein and gene expression data of the NCI-60 cancer cell panel. Both a simulation study and the real data application show that the sparse LTS has better prediction performance than its competitors in the presence of leverage points.

### MSC:

 62G08 Nonparametric regression and quantile regression 62G35 Nonparametric robustness 62-08 Computational methods for problems pertaining to statistics 62P10 Applications of statistics to biology and medical sciences; meta analysis

### Software:

lars; R; robustHD; quantreg; robustbase; simFrame
Full Text:

### References:

 [1] Alfons, A. (2012a). simFrame : Simulation framework. R package version 0.5.0. [2] Alfons, A. (2012b). robustHD : Robust methods for high-dimensional data. R package version 0.1.0. [3] Alfons, A., Templ, M. and Filzmoser, P. (2010). An object-oriented framework for statistical simulation: The R package simFrame. Journal of Statistical Software 37 1-36. [4] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-499. · Zbl 1091.62054 [5] Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348-1360. · Zbl 1073.62547 [6] Germain, J.-F. and Roueff, F. (2010). Weak convergence of the regularization path in penalized M-estimation. Scand. J. Stat. 37 477-495. · Zbl 1226.60032 [7] Gertheiss, J. and Tutz, G. (2010). Sparse modeling of categorial explanatory variables. Ann. Appl. Stat. 4 2150-2180. · Zbl 1220.62092 [8] Hassan, R., Bera, T. and Pastan, I. (2004). Mesothelin: A new target for immunotherapy. Clin. Cancer Res. 10 3937-3942. [9] Hastie, T. and Efron, B. (2011). lars : Least angle regression, lasso and forward stagewise. R package version 0.9-8. [10] Khan, J. A., Van Aelst, S. and Zamar, R. H. (2007). Robust linear model selection based on least angle regression. J. Amer. Statist. Assoc. 102 1289-1299. · Zbl 1332.62240 [11] Knight, K. and Fu, W. (2000). Asymptotics for lasso-type estimators. Ann. Statist. 28 1356-1378. · Zbl 1105.62357 [12] Koenker, R. (2011). quantreg : Quantile regression. R package version 4.67. · Zbl 1236.62031 [13] Lee, D., Lee, W., Lee, Y. and Pawitan, Y. (2011). Sparse partial least-squares regression and its applications to high-throughput data analysis. Chemometrics and Intelligent Laboratory Systems 109 1-8. · Zbl 1296.92045 [14] Li, G., Peng, H. and Zhu, L. (2011). Nonconcave penalized $$M$$-estimation with a diverging number of parameters. Statist. Sinica 21 391-419. · Zbl 1206.62036 [15] Maglott, D., Ostell, J., Pruitt, K. D. and Tatusova, T. (2005). Entrez gene: Gene-centered information at NCBI. Nucleic Acids Res. 33 D54-D58. [16] Maronna, R. A. (2011). Robust ridge regression for high-dimensional data. Technometrics 53 44-53. [17] Maronna, R. A., Martin, R. D. and Yohai, V. J. (2006). Robust Statistics : Theory and Methods . Wiley, Chichester. · Zbl 1094.62040 [18] Meinshausen, N. (2007). Relaxed lasso. Comput. Statist. Data Anal. 52 374-393. · Zbl 1452.62522 [19] Menjoge, R. S. and Welsch, R. E. (2010). A diagnostic method for simultaneous feature selection and outlier identification in linear regression. Comput. Statist. Data Anal. 54 3181-3193. · Zbl 1284.62426 [20] Oshima, R. G., Baribault, H. and Caulín, C. (1996). Oncogenic regulation and function of keratins 8 and 18. Cancer and Metastasis Rewiews 15 445-471. [21] Owens, D. W. and Lane, E. B. (2003). The quest for the function of simple epithelial keratins. Bioessays 25 748-758. [22] R Development Core Team (2011). R : A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. [23] Radchenko, P. and James, G. M. (2011). Improved variable selection with forward-lasso adaptive shrinkage. Ann. Appl. Stat. 5 427-448. · Zbl 1220.62089 [24] Rosset, S. and Zhu, J. (2004). Discussion of “Least angle regression,” by B. Efron, T. Hastie, I. Johnstone and R. Tibshirani. Ann. Statist. 32 469-475. · Zbl 1091.62054 [25] Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc. 79 871-880. · Zbl 0547.62046 [26] Rousseeuw, P. J. and Leroy, A. M. (2003). Robust Regression and Outlier Detection , 2nd ed. Wiley, Hoboken. · Zbl 0711.62030 [27] Rousseeuw, P. J. and Van Driessen, K. (2006). Computing LTS regression for large data sets. Data Min. Knowl. Discov. 12 29-45. · Zbl 1034.62058 [28] Shankavaram, U. T., Reinhold, W. C., Nishizuka, S., Major, S., Morita, D., Chary, K. K., Reimers, M. A., Scherf, U., Kahn, A., Dolginow, D., Cossman, J., Kaldjian, E. P., Scudiero, D. A., Petricoin, E., Liotta, L., Lee, J. K. and Weinstein, J. N. (2007). Transcript and protein expression profiles of the NCI-60 cancer cell panel: An integromic microarray study. Molecular Cancer Therapeutics 6 820-832. [29] She, Y. and Owen, A. B. (2011). Outlier detection using nonconvex penalized regression. J. Amer. Statist. Assoc. 106 626-639. · Zbl 1232.62068 [30] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538 [31] van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. Ann. Statist. 36 614-645. · Zbl 1138.62323 [32] Wang, H., Li, G. and Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the LAD-lasso. J. Bus. Econom. Statist. 25 347-355. [33] Wang, S., Nan, B., Rosset, S. and Zhu, J. (2011). Random lasso. Ann. Appl. Stat. 5 468-485. · Zbl 1220.62091 [34] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat. 2 224-244. · Zbl 1137.62045 [35] Yohai, V. J. (1987). High breakdown-point and high efficiency robust estimates for regression. Ann. Statist. 15 642-656. · Zbl 0624.62037 [36] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49-67. · Zbl 1141.62030 [37] Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. J. Mach. Learn. Res. 7 2541-2563. · Zbl 1222.62008 [38] Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418-1429. · Zbl 1171.62326 [39] Zou, H., Hastie, T. and Tibshirani, R. (2007). On the “degrees of freedom” of the lasso. Ann. Statist. 35 2173-2192. · Zbl 1126.62061
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.