×

Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates. (English) Zbl 1384.62139

Summary: We present a new algorithm for boosting generalized additive models for location, scale and shape (GAMLSS) that allows to incorporate stability selection, an increasingly popular way to obtain stable sets of covariates while controlling the per-family error rate. The model is fitted repeatedly to subsampled data, and variables with high selection frequencies are extracted. To apply stability selection to boosted GAMLSS, we develop a new “noncyclical” fitting algorithm that incorporates an additional selection step of the best-fitting distribution parameter in each iteration. This new algorithm has the additional advantage that optimizing the tuning parameters of boosting is reduced from a multi-dimensional to a one-dimensional problem with vastly decreased complexity. The performance of the novel algorithm is evaluated in an extensive simulation study. We apply this new algorithm to a study to estimate abundance of common eider in Massachusetts, USA, featuring excess zeros, overdispersion, nonlinearity and spatiotemporal structures. Eider abundance is estimated via boosted GAMLSS, allowing both mean and overdispersion to be regressed on covariates. Stability selection is used to obtain a sparse set of stable predictors.

MSC:

62G08 Nonparametric regression and quantile regression
65C60 Computational problems in statistics (MSC2010)
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aho, K., Derryberry, D.W., Peterson, T.: Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95, 631-636 (2014) · doi:10.1890/13-1452.1
[2] Anderson, D.R., Burnham, K.P.: Avoiding pitfalls when using information-theoretic methods. J. Wildl. Manag. 912-918 (2002)
[3] Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting. Stat. Sci. 22, 477-505 (2007) · Zbl 1246.62163 · doi:10.1214/07-STS242
[4] Bühlmann, P., Hothorn, T.: Twin boosting: improved feature selection and prediction. Stat. Comput. 20, 119-138 (2010) · doi:10.1007/s11222-009-9148-5
[5] Bühlmann, P., Yu, B.: Boosting with the L \[_22\] loss: regression and classification. J. Am. Stat. Assoc. 98, 324-339 (2003) · Zbl 1041.62029 · doi:10.1198/016214503000125
[6] Bühlmann, P., Yu, B.: Sparse boosting. J. Mach. Learn. Res. 7, 1001-1024 (2006) · Zbl 1222.68155
[7] Bühlmann, P., Gertheiss, J., Hieke, S., Kneib, T., Ma, S., Schumacher, M., Tutz, G., Wang, C., Wang, Z., Ziegler, A., et al.: Discussion of “the evolution of boosting algorithms” and “extending statistical boosting”. Methods Inf. Med. 53(6), 436-445 (2014) · doi:10.3414/13100122
[8] Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carre, G., Marquez, J.R.G., Gruber, B., Lafourcade, B., Leitao, P.J., Münkemüller, T., McClean, C., Osborne, P.E., Reineking, B., Schröder, B., Skidmore, A.K., Zurell, D., Lautenbach, S.: Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27-46 (2013) · doi:10.1111/j.1600-0587.2012.07348.x
[9] Flack, V.F., Chang, P.C.: Frequency of selecting noise variables in subset regression analysis: a simulation study. Am. Stat. 41(1), 84-86 (1987)
[10] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28(2), 337-407 (2000) · Zbl 1106.62323 · doi:10.1214/aos/1016218223
[11] Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models, vol. 43. CRC Press, Boca Raton (1990) · Zbl 0747.62061
[12] Hofner, B., Boccuto, L., Göker, M.: Controlling false discoveries in high-dimensional situations: boosting with stability selection. BMC Bioinf. 16(1), 144 (2015) · doi:10.1186/s12859-015-0575-3
[13] Hofner, B., Hothorn, T.: stabs: stability selection with error control (2017). http://CRAN.R-project.org/package=stabs. R package version 0.6-2
[14] Hofner, B., Hothorn, T., Kneib, T., Schmid, M.: A framework for unbiased model selection based on boosting. J. Comput. Gr. Stat. 20, 956-971 (2011) · doi:10.1198/jcgs.2011.09220
[15] Hofner, B., Mayr, A., Fenske, N., Thomas, J., Schmid, M.: gamboostLSS: boosting methods for GAMLSS models (2017). http://CRAN.R-project.org/package=gamboostLSS. R package version 2.0-0
[16] Hofner, B., Mayr, A., Robinzonov, N., Schmid, M.: Model-based boosting in R—A hands-on tutorial using the R package mboost. Comput. Stat. 29, 3-35 (2014) · Zbl 1306.65069 · doi:10.1007/s00180-012-0382-5
[17] Hofner, B., Mayr, A., Schmid, M.: gamboostLSS: an R package for model building and variable selection in the GAMLSS framework. J. Stat. Softw. 74(1), 1-31 (2016) · doi:10.18637/jss.v074.i01
[18] Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner, B.: Model-based boosting 2.0. J. Mach. Learn. Res. 11, 2109-2113 (2010) · Zbl 1242.68002
[19] Hothorn, T., Buehlmann, P., Kneib, T., Schmid, T., Hofner, B.: mboost: model-based boosting (2017). http://CRAN.R-project.org/package=mboost. R package version 2.8-0 · Zbl 1490.62201
[20] Hothorn, T., Müller, J., Schröder, B., Kneib, T., Brandl, R.: Decomposing environmental, spatial, and spatiotemporal components of species distributions. Ecol. Monogr. 81, 329-347 (2011) · doi:10.1890/10-0602.1
[21] Huang, S.M.Y., Huang, J., Fang, K.: Gene network-based cancer prognosis analysis with sparse boosting. Genet. Res. 94, 205-221 (2012) · doi:10.1017/S0016672312000419
[22] Li, P.: Robust logitboost and adaptive base class (abc) logitboost (2012). arXiv preprint arXiv:1203.3491
[23] Mayr, A., Binder, H., Gefeller, O., Schmid, M., et al.: The evolution of boosting algorithms. Methods Inf. Med. 53(6), 419-427 (2014) · doi:10.3414/ME13-01-0122
[24] Mayr, A., Binder, H., Gefeller, O., Schmid, M., et al.: Extending statistical boosting. Methods Inf. Med. 53(6), 428-435 (2014) · Zbl 1242.68002
[25] Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M.: Generalized additive models for location, scale and shape for high-dimensional data—a flexible approach based on boosting. J. R. Stat. Soc. Ser. C Appl. Stat. 61(3), 403-427 (2012) · doi:10.1111/j.1467-9876.2011.01033.x
[26] Mayr, A., Hofner, B., Schmid, M.: Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection. BMC Bioinf. 17(1), 288 (2016)
[27] Mayr, A., Hofner, B., Schmid, M., et al.: The importance of knowing when to stop. Methods Inf. Med. 51(2), 178-186 (2012) · doi:10.3414/ME11-02-0030
[28] Meinshausen, N., Bühlmann, P.: Stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72(4), 417-473 (2010) · Zbl 1411.62142 · doi:10.1111/j.1467-9868.2010.00740.x
[29] Messner, J.W., Mayr, G.J., Zeileis, A.: Nonhomogeneous boosting for predictor selection in ensemble postprocessing. Mon. Weather Rev. 145(1), 137-147 (2017). doi:10.1175/MWR-D-16-0088.1 · doi:10.1175/MWR-D-16-0088.1
[30] Mullahy, J.: Specification and testing of some modified count data models. J. Econom. 33(3), 341-365 (1986) · doi:10.1016/0304-4076(86)90002-3
[31] Murtaugh, P.A.: Performance of several variable-selection methods applied to real ecological data. Ecol. Lett. 12, 1061-1068 (2009) · doi:10.1111/j.1461-0248.2009.01361.x
[32] Opelt, A., Fussenegger, M., Pinz, A., Auer, P.: Weak hypotheses and boosting for generic object detection and recognition. In: European Conference on Computer Vision, pp. 71-84. Springer (2004) · Zbl 1098.68834
[33] Osorio, J.D.G., Galiano, S.G.G.: Non-stationary analysis of dry spells in monsoon season of Senegal River Basin using data from regional climate models (RCMs). J. Hydrol. 450-451, 82-92 (2012) · doi:10.1016/j.jhydrol.2012.05.029
[34] Rigby, R.A., Stasinopoulos, D.M.: Generalized additive models for location, scale and shape. J. R. Stat. Soc. Ser. C (Appl. Stat.) 54(3), 507-554 (2005) · Zbl 1490.62201 · doi:10.1111/j.1467-9876.2005.00510.x
[35] Rigby, R.A., Stasinopoulos, D.M., Akantziliotou, C.: Instructions on how to use the gamlss package in R (2008). http://www.gamlss.org/wp-content/uploads/2013/01/gamlss-manual.pdf · Zbl 1231.62019
[36] Schmid, M., Hothorn, T.: Boosting additive models using component-wise P-splines. Comput. Stat. Data Anal. 53(2), 298-311 (2008) · Zbl 1231.62071 · doi:10.1016/j.csda.2008.09.009
[37] Schmid, M., Potapov, S., Pfahlberg, A., Hothorn, T.: Estimation and regularization techniques for regression models with multidimensional prediction functions. Stat. Comput. 20(2), 139-150 (2010) · doi:10.1007/s11222-009-9162-7
[38] Shah, R.D., Samworth, R.J.: Variable selection with error control: Another look at stability selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 75(1), 55-80 (2013) · Zbl 07555438 · doi:10.1111/j.1467-9868.2011.01034.x
[39] Smith, A.D., Hofner, B., Osenkowski, J.E., Allison, T., Sadoti, G., McWilliams, S.R., Paton, P.W.C.: Spatiotemporal modelling of sea duck abundance: implications for marine spatial planning (2017). arXiv preprint arXiv:1705.00644 · Zbl 1041.62029
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.