×

Sparse regression for large data sets with outliers. (English) Zbl 1487.62085

Summary: The linear regression model remains an important workhorse for data scientists. However, many data sets contain many more predictors than observations. Besides, outliers, or anomalies, frequently occur. This paper proposes an algorithm for regression analysis that addresses these features typical for big data sets, which we call “sparse shooting S”. The resulting regression coefficients are sparse, meaning that many of them are set to zero, hereby selecting the most relevant predictors. A distinct feature of the method is its robustness with respect to outliers in the cells of the data matrix. The excellent performance of this robust variable selection and prediction method is shown in a simulation study. A real data application on car fuel consumption demonstrates its usefulness.

MSC:

62J07 Ridge regression; shrinkage estimators (Lasso)
62F35 Robustness and adaptive procedures (parametric inference)

Software:

robustbase; robustHD; R
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abolhassani, A.; Harner, E. J.; Jaridi, M., Empirical analysis of productivity enhancement strategies in the North American automotive industry, International Journal of Production Economics, 208, 140-159 (2019)
[2] Alfons, A. (2016). robustHD: Robust methods for high-dimensional data. https://CRAN.R-project.org/package=robustHD R package version 0.6.1.
[3] Alfons, A.; Croux, C.; Gelper, S., Sparse least trimmed squares regression for analyzing high-dimensional large data sets, The Annals of Applied Statistics, 7, 1, 226-248 (2013) · Zbl 1454.62123
[4] Ali, O. G.; Yaman, K., Selecting rows and columns for training support vector regression models with large retail datasets, European Journal of Operational Research, 226, 3, 471-480 (2013) · Zbl 1292.62096
[5] Alqallaf, F.; Van Aelst, S.; Yohai, V. J.; Zamar, R. H., Propagation of outliers in multivariate data, The Annals of Statistics, 37, 1, 311-331 (2009) · Zbl 1155.62043
[6] Ang, E.; Kwasnick, S.; Bayati, M.; Plambeck, E. L.; Aratow, M., Accurate emergency department wait time prediction, Manufacturing & Service Operations Management, 18, 1, 141-156 (2016)
[7] Ballings, M.; Van den Poel, D., CRM in social media: Predicting increases in Facebook usage frequency, European Journal of Operational Research, 244, 1, 248-260 (2015) · Zbl 1346.90412
[8] Belloni, A.; Chernozhukov, V., High dimensional sparse econometric models: An introduction, Inverse problems and high-dimensional estimation, 121-156 (2011), Springer
[9] Bertsimas, D.; Copenhaver, M. S., Characterization of the equivalence of robustification and regularization in linear and matrix regression, European Journal of Operational Research, 270, 3, 931-942 (2018) · Zbl 1403.62040
[10] Cetin, M., Robust model selection criteria for robust Liu estimator, European Journal of Operational Research, 199, 1, 21-24 (2009) · Zbl 1176.62065
[11] Chang, L.; Roberts, S.; Welsh, A., Robust lasso regression using Tukey’s biweight criterion, Technometrics, 60, 1, 36-47 (2018)
[12] Chernozhukov, V.; Hansen, C.; Spindler, M., Post-selection and post-regularization inference in linear models with many controls and instruments, American Economic Review, 105, 5, 486-490 (2015)
[13] Croux, C.; Dehon, C., Influence functions of the Spearman and Kendall correlation measures, Statistical Methods & Applications, 19, 4, 497-515 (2010) · Zbl 1332.62186
[14] Cui, H.; Rajagopalan, S.; Ward, A. R., Predicting product return volume using machine learning methods, European Journal of Operational Research, 281, 3, 612-627 (2020)
[15] Flores, S., SOCP relaxation bounds for the optimal subset selection problem applied to robust linear regression, European Journal of Operational Research, 246, 1, 44-50 (2015) · Zbl 1346.90612
[16] Friedman, J.; Hastie, T.; Höfling, H.; Tibshirani, R., Pathwise coordinate optimization, The Annals of Applied Statistics, 1, 2, 302-332 (2007) · Zbl 1378.90064
[17] Friedman, J.; Hastie, T.; Tibshirani, R., The elements of statistical learning (2001), Springer series in statistics New York · Zbl 0973.62007
[18] Gertheiss, J.; Tutz, G., Sparse modeling of categorial explanatory variables, The Annals of Applied Statistics, 4, 4, 2150-2180 (2010) · Zbl 1220.62092
[19] Ghaddar, B.; Naoum-Sawaya, J., High dimensional data classification and feature selection using support vector machines, European Journal of Operational Research, 265, 3, 993-1004 (2018) · Zbl 1381.62170
[20] Grznar, J.; Prasad, S.; Tata, J., Neural networks and organizational systems: Modeling non-linear relationships, European Journal of Operational Research, 181, 2, 939-955 (2007) · Zbl 1131.90026
[21] Huang, T.; Fildes, R.; Soopramanien, D., The value of competitive information in forecasting FMCG retail product sales and the variable selection problem, European Journal of Operational Research, 237, 2, 738-748 (2014)
[22] Huck, N., Large data sets and machine learning: Applications to statistical arbitrage, European Journal of Operational Research, 278, 1, 330-342 (2019) · Zbl 1414.91435
[23] Joki, K.; Bagirov, A. M.; Karmitsa, N.; Mäkelä, M. M.; Taheri, S., Clusterwise support vector linear regression, European Journal of Operational Research, 287, 1, 19-35 (2020) · Zbl 1443.90281
[24] Khan, J. A.; Van Aelst, S.; Zamar, R. H., Robust linear model selection based on least angle regression, Journal of the American Statistical Association, 102, 480, 1289-1299 (2007) · Zbl 1332.62240
[25] Kurnaz, F. S.; Hoffmann, I.; Filzmoser, P., Robust and sparse estimation methods for high-dimensional linear and logistic regression, Chemometrics and Intelligent Laboratory Systems, 172, 211-222 (2018)
[26] Landajo, M.; de Andres, J.; Lorca, P., Robust neural modeling for the cross-sectional analysis of accounting information, European Journal of Operational Research, 177, 2, 1232-1252 (2007) · Zbl 1109.62082
[27] Lee, I. G.; Zhang, Q.; Yoon, S. W.; Won, D., A mixed integer linear programming support vector machine for cost-effective feature selection, Knowledge-Based Systems, 203, 106145 (2020)
[28] Leung, A.; Zhang, H.; Zamar, R., Robust regression estimation and inference in the presence of cellwise and casewise contamination, Computational Statistics and Data Analysis, 99, 1-11 (2016) · Zbl 1468.62118
[29] Ma, S.; Fildes, R.; Huang, T., Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra- and inter-category promotional information, European Journal of Operational Research, 249, 1, 245-257 (2016) · Zbl 1346.62165
[30] Machkour, J.; Alt, B.; Muma, M.; Zoubir, A. M., The outlier-corrected-data-adaptive lasso: A new robust estimator for the independent contamination model, 2017 25th European signal processing conference (EUSIPCO), 1649-1653 (2017), IEEE
[31] Machkour, J.; Muma, M.; Alt, B.; Zoubir, A. M., A robust adaptive lasso estimator for the independent contamination model, Signal Processing, 174, 107608 (2020)
[32] Maronna, R. A.; Martin, R. D.; Yohai, V. J.; Salibián-Barrera, M., Robust statistics: Theory and methods (with R) (2018), Wiley
[33] Martin-Barragan, B.; Lillo, R.; Romo, J., Interpretable support vector machines for functional data, European Journal of Operational Research, 232, 1, 146-155 (2014)
[34] Martínez, A.; Schmuck, C.; Pereverzyev Jr, S.; Pirker, C.; Haltmeier, M., A machine learning framework for customer purchase prediction in the non-contractual setting, European Journal of Operational Research, 281, 3, 588-596 (2020)
[35] Masci, C.; Johnes, G.; Agasisti, T., Student and school performance across countries: A machine learning approach, European Journal of Operational Research, 269, 3, 1072-1085 (2018) · Zbl 1388.62378
[36] Nazemi, A.; Heidenreich, K.; Fabozzi, F. J., Improving corporate bond recovery rate prediction using multi-factor support vector regressions, European Journal of Operational Research, 271, 2, 664-675 (2018) · Zbl 1403.91369
[37] Oelker, M.-R.; Tutz, G., A uniform framework for the combination of penalties in generalized structured models, Advances in Data Analysis and Classification, 11, 1, 97-120 (2017) · Zbl 1414.62321
[38] Öllerer, V.; Alfons, A.; Croux, C., The shooting S-estimator for robust regression, Computational Statistics, 31, 3, 829-844 (2016) · Zbl 1347.65027
[39] Pun, C. S.; Wong, H. Y., A linear programming model for selection of sparse high-dimensional multiperiod portfolios, European Journal of Operational Research, 273, 2, 754-771 (2019) · Zbl 1403.90506
[40] R Core Team (2017). R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. URL: https://www.R-project.org/
[41] Rousseeuw, P.; Van Den Bossche, W., Detecting deviating data cells, Technometrics, 60, 2, 135-145 (2018)
[42] Rousseeuw, P.; Yohai, V. J., Robust regression by means of S-estimators, (Franke, J.; Härdle, W.; Martin, D., Robust and nonlinear time series analysis. Lecture notes in statistics, vol. 26 (1984), Springer: Springer New York, NY), 256-272 · Zbl 0567.62027
[43] Rousseeuw, P. J.; Leroy, A. M., Robust regression and outlier detection, vol. 589 (2005), John Wiley & Sons
[44] Sagaert, Y. R.; Aghezzaf, E.-H.; Kourentzes, N.; Desmet, B., Tactical sales forecasting using a very large set of macroeconomic indicators, European Journal of Operational Research, 264, 2, 558-569 (2018) · Zbl 1376.62116
[45] Salibian-Barrera, M.; Yohai, V. J., A fast algorithm for S-regression estimates, Journal of Computational and Graphical Statistics, 15, 2, 414-427 (2006)
[46] Smucler, E.; Yohai, V. J., Robust and sparse estimators for linear regression models, Computational Statistics and Data Analysis, 111, 116-130 (2017) · Zbl 1464.62164
[47] Tibshirani, R., Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological), 58, 1, 267-288 (1996) · Zbl 0850.62538
[48] Tseng, P., Convergence of a block coordinate descent method for nondifferentiable minimization, Journal of Optimization Theory and Applications, 109, 3, 475-494 (2001) · Zbl 1006.65062
[49] Wilms, I.; Gelper, S.; Croux, C., The predictive power of the business and bank sentiment of firms: A high-dimensional Granger causality approach, European Journal of Operational Research, 254, 1, 138-147 (2016) · Zbl 1347.62206
[50] Yoon, G.; Carroll, R. J.; Gaynanova, I., Sparse semiparametric canonical correlation analysis for data of mixed types, Biometrika, 107, 3, 609-625 (2020) · Zbl 1451.62051
[51] Zhang, Y.; Li, R.; Tsai, C.-L., Regularization parameter selections via generalized information criterion, Journal of the American Statistical Association, 105, 489, 312-323 (2010) · Zbl 1397.62262
[52] Zou, H.; Hastie, T.; Tibshirani, R., On the degrees of freedom of the lasso, The Annals of Statistics, 35, 5, 2173-2192 (2007) · Zbl 1126.62061
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.