×

zbMATH — the first resource for mathematics

On stability issues in deriving multivariable regression models. (English) Zbl 1329.62035
Summary: In many areas of science where empirical data are analyzed, a task is often to identify important variables with influence on an outcome. Most often this is done by using a variable selection strategy in the context of a multivariable regression model. Using a study on ozone effects in children (\(n = 496\), 24 covariates), we will discuss aspects relevant for deriving a suitable model. With an emphasis on model stability, we will explore and illustrate differences between predictive models and explanatory models, the key role of stopping criteria, and the value of bootstrap resampling (with and without replacement). Bootstrap resampling will be used to assess variable selection stability, to derive a predictor that incorporates model uncertainty, check for influential points, and visualize the variable selection process. For the latter two tasks we adapt and extend recent approaches, such as stability paths, to serve our purposes. Based on earlier experiences and on results from the example, we will argue for simpler models and that predictions are usually very similar, irrespective of the selection method used. Important differences exist for the corresponding variances, and the model uncertainty concept helps to protect against serious underestimation of the variance of a predictor-derived data dependently. Results of stability investigations illustrate severe difficulties in the task of deriving a suitable explanatory model. It seems possible to identify a small number of variables with an important and probably true influence on the outcome, but too often several variables are included whose selection may be a result of chance or may depend on a small number of observations.

MSC:
62-07 Data analysis (statistics) (MSC2010)
62H11 Directional data; spatial statistics
62P10 Applications of statistics to biology and medical sciences; meta analysis
62J05 Linear regression; mixed models
62F40 Bootstrap, jackknife and other resampling methods
Software:
party; rms
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Akaike, 2nd International Symposium on Information Theory pp 267– (1973)
[2] Altman, Bootstrap investigation of the stability of a Cox regression model, Statistics in Medicine 8 pp 771– (1989) · doi:10.1002/sim.4780080702
[3] Andersen, Regression with Linear Predictors (2010) · Zbl 1284.62025 · doi:10.1007/978-1-4419-7170-8
[4] Atkinson, Robustness, transformations and two graphical displays for outlying and influential observations in regression, Biometrika 68 pp 13– (1981) · Zbl 0462.62049 · doi:10.1093/biomet/68.1.13
[5] Augustin, The practical utility of incorporating model selection uncertainty into prognostic models for survival data, Statistic Modelling 5 pp 95– (2005) · Zbl 1071.62096 · doi:10.1191/1471082X05st089oa
[6] Babu, Resampling methods for model fitting and model selection, Journal of Biopharmaceutical Statistics 21 pp 1177– (2011) · doi:10.1080/10543406.2011.607749
[7] Binder, Stability analysis of an additive spline model for respiratory health data by using knot removal, Journal of the Royal Statistical Society C 58 pp 577– (2009) · doi:10.1111/j.1467-9876.2009.00668.x
[8] Binder, Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples, Statistical Applications in Genetics and Molecular Biology 7 (2008a) · Zbl 1276.62060 · doi:10.2202/1544-6115.1346
[9] Binder, Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models, BMC Bioinformatics 9 pp 14– (2008b) · Zbl 05326434 · doi:10.1186/1471-2105-9-14
[10] Bland, Statistical methods for assessing agreement between two methods of clinical measurement, Lancet 8 pp 307– (1986) · doi:10.1016/S0140-6736(86)90837-8
[11] Box, Robustness in Statistics pp 201– (1979) · doi:10.1016/B978-0-12-438150-6.50018-2
[12] Breiman, The little bootstrap and other methods for dimensionality selection in regression: X-fixed prediction error, Journal of the American Statistical Association 87 pp 738– (1992) · Zbl 0850.62518 · doi:10.1080/01621459.1992.10475276
[13] Breiman, Bagging predictors, Machine Learning 24 pp 123– (1996) · Zbl 0858.68080 · doi:10.1007/BF00058655
[14] Buchholz, On properties of predictors derived with a two-step bootstrap model averaging approach-a simulation study in the linear regression model, Computational Statistics and Data Analysis 52 pp 2778– (2008) · Zbl 1452.62038 · doi:10.1016/j.csda.2007.10.007
[15] Buckland, Model selection: an integral part of inference, Biometrics 53 pp 603– (1997) · Zbl 0885.62118 · doi:10.2307/2533961
[16] Bühlmann, Boosting with the L2 loss: regression and classification, Journal of the American Statistical Association 98 pp 324– (2003) · Zbl 1041.62029 · doi:10.1198/016214503000125
[17] Burnham, Multimodel inference-understanding AIC and BIC in model selection, Sociological Methods & Research 33 pp 261– (2004) · doi:10.1177/0049124104268644
[18] Chen, The bootstrap and identification of prognostic factors via Cox’s proportional hazards regression model, Statistics in Medicine 4 pp 39– (1985) · doi:10.1002/sim.4780040107
[19] Chernick, Bootstrap Methods. A Guide for Practitioners and Researchers (2008)
[20] Cox, Regression models and life-tables (with discussion), Journal of the Royal Statistical Society Series B 34 pp 187– (1972) · Zbl 0243.62041
[21] Davison, Bootstrap Methods and their Application (1997) · Zbl 0886.62001 · doi:10.1017/CBO9780511802843
[22] Davison, Recent developments in bootstrap methodology, Statistical Science 18 pp 141– (2003) · Zbl 1331.62179 · doi:10.1214/ss/1063994969
[23] Efron, Bootstrap methods: another look at the Jackknife, Annals of Statistics 7 pp 1– (1979) · Zbl 0406.62024 · doi:10.1214/aos/1176344552
[24] Efron, Estimating the error rate of a prediction rule: improvement on cross-validation, Journal of the American Statistical Association 78 pp 316– (1983) · Zbl 0543.62079 · doi:10.1080/01621459.1983.10477973
[25] Efron, Least angle regression, Annals of Statistics 32 pp 407– (2004) · Zbl 1091.62054 · doi:10.1214/009053604000000067
[26] Furnival, Regression by leaps and bounds, Technometrics 16 pp 499– (1974) · Zbl 0294.62079 · doi:10.1080/00401706.1974.10489231
[27] Gifi, Nonlinear Multivariate Analysis (1990) · Zbl 0697.62048
[28] Gong, Computer Science and Statistics: Proceedings of the 14th Symposium on the Interface pp 169– (1982)
[29] Greenland, Modeling and variable selection in epidemiologic analysis, American Journal of Public Health 79 pp 340– (1989) · doi:10.2105/AJPH.79.3.340
[30] Harrell, Regression modelling strategies, with Applications to Linear Models, Logistic Regression, and Survival Analysis (2001) · Zbl 0982.62063
[31] Hastie, Forward stagewise regression and the monotone lasso, Electronic Journal of Statistics 1 pp 1– (2007) · Zbl 1306.62176 · doi:10.1214/07-EJS004
[32] Hoeting, Bayesian model averaging: a tutorial (with discussion), Statistical Science 14 pp 382– (1999) · Zbl 1059.62525
[33] Hothorn , T. Hornik , K. Strobl , K. Zeileis , A. 2013 A laboratory for recursive partitioning R package version 1.0-8 http://cran.r-project.org/web/packages/party/index.html
[34] Ihorst, Long- and medium-term ozone effects on lung growth including a broad spectrum of exposure, European Respiratory Journal 23 pp 292– (2004) · doi:10.1183/09031936.04.00021704
[35] Janitza , S. Binder , H. Boulesteix , A.-L. 2014 Pitfalls of hypothesis tests and model selection on bootstrap samples: causes and consequences in biometrical applications Technical Report 163 Department of Statistics, LMU Munich https://epub.ub.uni-muenchen.de/21038/index.html · Zbl 1386.62053
[36] LePage, Exploring the Limits of Bootstrap (1992)
[37] Meinshausen, Stability selection, Journal of the Royal Statistical Society B 72 pp 417– (2010) · doi:10.1111/j.1467-9868.2010.00740.x
[38] Miller, Subset Selection in Regression (2002) · Zbl 1051.62060 · doi:10.1201/9781420035933
[39] Nixon, Parametric modelling of cost data in medical studies, Statistics in Medicine 23 pp 1311– (2004) · doi:10.1002/sim.1744
[40] Park, The Bayesian lasso, Journal of the American Statistical Association 103 pp 681– (2008) · Zbl 1330.62292 · doi:10.1198/016214508000000337
[41] Porzelius, Sparse regression techniques in low-dimensional survival settings, Statistics and Computing 20 pp 151– (2010) · doi:10.1007/s11222-009-9155-6
[42] Rousseeuw, Robust Regression and Outlier Detection (1987) · doi:10.1002/0471725382
[43] Royston, Dichotomizing continuous predictors in multiple regression: a bad idea, Statistics in Medicine 25 pp 127– (2006) · doi:10.1002/sim.2331
[44] Royston, Stability of multivariable fractional polynomial models with selection of variables and transformations: a bootstrap investigation, Statistics in Medicine 22 pp 639– (2003) · doi:10.1002/sim.1310
[45] Royston, Multivariable Model-Building-A Pragmatic Approach to Regression Analysis Based on Fractional Polynomials for Modelling Continuous Variables (2008) · Zbl 1269.62053
[46] Sauerbrei, Europäsche Perspektiven der Medizinischen Informatik, Biometrie und Epidemiologie pp 108– (1993)
[47] Sauerbrei, The use of resampling methods to simplify regression models in medical statistics, Applied Statistics 48 pp 313– (1999) · Zbl 0939.62114
[48] Sauerbrei, Stability investigations of multivariable regression models derived for low and high dimensional data, Journal of Biopharmaceutical Statistics 21 pp 1206– (2011) · doi:10.1080/10543406.2011.629890
[49] Sauerbrei, Modelling to extract more information from clinical trials data-on some roles for the bootstrap, Statistics in Medicine 26 pp 4989– (2007) · doi:10.1002/sim.2954
[50] Sauerbrei, Selection of important variables and determination of functional form for continuous predictors in multivariable model-building, Statistics in Medicine 26 pp 5512– (2007) · doi:10.1002/sim.3148
[51] Sauerbrei, A bootstrap resampling procedure for model building: application to the Cox regression model, Statistics in Medicine 11 pp 2093– (1992) · doi:10.1002/sim.4780111607
[52] Schemper, The relative importance of prognostic factors in studies of survival, Statistics in Medicine 12 pp 2377– (1993) · doi:10.1002/sim.4780122413
[53] Shmueli, To explain or to predict?, Statistical Science 3 pp 289– (2010) · Zbl 1329.62045 · doi:10.1214/10-STS330
[54] Strobl, Bias in random forest variable importance measures: illustrations, sources and a solution, BMC Bioinformatics 8 pp 25– (2007) · doi:10.1186/1471-2105-8-25
[55] Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society 58 pp 267– (1996) · Zbl 0850.62538
[56] Tsao, Subsampling method for robust estimation of regression models, Open Journal of Statistics 2 pp 281– (2012) · doi:10.4236/ojs.2012.23034
[57] Tutz, Boosting ridge regression, Computational Statistics and Data Analysis 51 pp 6044– (2007) · Zbl 1330.62294 · doi:10.1016/j.csda.2006.11.041
[58] Van Houwelingen, Shrinkage and penalized likelihood as methods to improve predictive accuracy, Statistica Neerlandica 55 pp 17– (2001) · Zbl 1075.62591 · doi:10.1111/1467-9574.00154
[59] Van Houwelingen, Cross-validation, shrinkage and variable selection in linear regression revisited, Open Journal of Statistics 3 pp 79– (2013) · doi:10.4236/ojs.2013.32011
[60] Westfall, On using the bootstrap for multiple comparisons, Journal of Biopharmaceutical Statistics 21 pp 1187– (2011) · doi:10.1080/10543406.2011.607751
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.