Liu, Lin; Qiu, Yuqi; Natarajan, Loki; Messer, Karen Imputation and post-selection inference in models with missing data: an application to colorectal cancer surveillance guidelines. (English) Zbl 1434.62226 Ann. Appl. Stat. 13, No. 3, 1370-1396 (2019). Summary: It is common to encounter missing data among the potential predictor variables in the setting of model selection. For example, in a recent study we attempted to improve the US guidelines for risk stratification after screening colonoscopy [the first author et al., “A prognostic model for advanced colorectal neoplasia recurrence”, Cancer Causes Control 27, 1175–1185 (2016; doi:10.1007/s10552-016-0795-5)], with the aim to help reduce both overuse and underuse of follow-on surveillance colonoscopy. The goal was to incorporate selected additional informative variables into a neoplasia risk-prediction model, going beyond the three currently established risk factors, using a large dataset pooled from seven different prospective studies in North America. Unfortunately, not all candidate variables were collected in all studies, so that one or more important potential predictors were missing on over half of the subjects. Thus, while variable selection was a main focus of the study, it was necessary to address the substantial amount of missing data. Multiple imputation can effectively address missing data, and there are also good approaches to incorporate the variable selection process into model-based confidence intervals. However, there is not consensus on appropriate methods of inference which address both issues simultaneously. Our goal here is to study the properties of model-based confidence intervals in the setting of imputation for missing data followed by variable selection. We use both simulation and theory to compare three approaches to such post-imputation-selection inference: a multiple-imputation approach based on Rubin’s Rules for variance estimation [M. Schomaker and C. Heumann, Comput. Stat. Data Anal. 71, 758–770 (2014; Zbl 1471.62181)]; a single imputation-selection followed by bootstrap percentile confidence intervals; and a new bootstrap model-averaging approach presented here, following [B. Efron, J. Am. Stat. Assoc. 109, No. 507, 991–1007 (2014; Zbl 1368.62071)]. We investigate relative strengths and weaknesses of each method. The “Rubin’s Rules” multiple imputation estimator can have severe undercoverage, and is not recommended. The imputation-selection estimator with bootstrap percentile confidence intervals works well. The bootstrap-model-averaged estimator, with the “Efron’s Rules” estimated variance, may be preferred if the true effect sizes are moderate. We apply these results to the colorectal neoplasia risk-prediction problem which motivated the present work. Cited in 1 Document MSC: 62P10 Applications of statistics to biology and medical sciences; meta analysis 62D10 Missing data 62H12 Estimation in multivariate analysis 62J15 Paired and multiple comparisons; multiple testing Keywords:post-selection inference; missing data; multiple imputation; model selection; model averaging; Efron’s Rules Citations:Zbl 1368.62071; Zbl 1471.62181 Software:MICE; MAMI × Cite Format Result Cite Review PDF Full Text: DOI Euclid References: [1] Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Models: A Modern Perspective, 2nd ed. Monographs on Statistics and Applied Probability 105. CRC Press/CRC, Boca Raton, FL. · Zbl 1119.62063 [2] Chatterjee, A. and Lahiri, S. N. (2010). Asymptotic properties of the residual bootstrap for Lasso estimators. Proc. Amer. Math. Soc. 138 4497-4509. · Zbl 1203.62014 · doi:10.1090/S0002-9939-2010-10474-4 [3] Chatterjee, A. and Lahiri, S. N. (2011). Bootstrapping lasso estimators. J. Amer. Statist. Assoc. 106 608-625. · Zbl 1232.62088 · doi:10.1198/jasa.2011.tm10159 [4] Claeskens, G. (2016). Statistical model choice. Ann. Rev. Stat. Appl. 3 233-256. [5] Claeskens, G. and Consentino, F. (2008). Variable selection with incomplete covariate data. Biometrics 64 1062-1069. · Zbl 1152.62388 · doi:10.1111/j.1541-0420.2008.01003.x [6] Claeskens, G. and Hjort, N. L. (2003). The focused information criterion. J. Amer. Statist. Assoc. 98 900-945. · Zbl 1045.62003 · doi:10.1198/016214503000000819 [7] Claeskens, G. and Hjort, N. L. (2008a). Minimizing average risk in regression models. Econometric Theory 24 493-527. · Zbl 1284.62454 · doi:10.1017/S0266466608080201 [8] Claeskens, G. and Hjort, N. L. (2008b). Model Selection and Model Averaging. Cambridge Series in Statistical and Probabilistic Mathematics 27. Cambridge Univ. Press, Cambridge. · Zbl 1166.62001 [9] Efron, B. (2014). Estimation and accuracy after model selection. J. Amer. Statist. Assoc. 109 991-1007. · Zbl 1368.62071 · doi:10.1080/01621459.2013.823775 [10] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc. 102 359-378. · Zbl 1284.62093 · doi:10.1198/016214506000001437 [11] Heymans, M. W., van Buuren, S., Knol, D. L., van Mechelen, W. and de Vet, H. C. W. (2007). Variable selection under multiple imputation using the bootstrap in a prognostic study. BMC Med. Res. Methodol. 7. [12] Hjort, N. L. (2014). Comment [MR3265671]. J. Amer. Statist. Assoc. 109 1017-1020. · Zbl 1368.62075 · doi:10.1080/01621459.2014.923315 [13] Hjort, N. L. and Claeskens, G. (2003). Frequentist model average estimators. J. Amer. Statist. Assoc. 98 879-899. · Zbl 1047.62003 · doi:10.1198/016214503000000828 [14] Hosmer, D. W. and Lemeshow, S. (1989). Applied Logistic Regression. Wiley-Interscience, New York. · Zbl 0967.62045 [15] Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. J. Amer. Statist. Assoc. 91 222-230. · Zbl 0870.62053 · doi:10.1080/01621459.1996.10476680 [16] Lachenbruch, P. A. (2011). Variable selection when missing values are present: A case study. Stat. Methods Med. Res. 20 429-444. · Zbl 1414.62453 · doi:10.1177/0962280209358003 [17] Lieberman, D. A., Rex, D. K., Winawer, S. J., Giardiello, F. M., Johnson, D. A. and Levin, T. R. (2012). Guidelines for colonoscopy surveillance after screening and polypectomy: A consensus update by the US multi-society task force on colorectal cancer. Gastroenterology 143 844-857. [18] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data, 2nd ed. Wiley Series in Probability and Statistics. Wiley-Interscience, Hoboken, NJ. · Zbl 1011.62004 [19] Liu, L., Messer, K., Baron, J. A., Lieberman, D. A., Jacobs, E. T., Cross, A. J., Murphy, G., Martinez, M. E. and Gupta, S. (2016). A prognostic model for advanced colorectal neoplasia recurrence. Cancer Causes Control 27 1175-1185. DOI:10.1007/s10552-016-0795-5. [20] Liu, L., Qiu, Y., Natarajan, L. and Messer, K. (2019). Supplement to “Imputation and post-selection inference in models with missing data: An application to colorectal cancer surveillance guidelines.” DOI:10.1214/19-AOAS1239SUPP. · Zbl 1434.62226 [21] Long, Q. and Johnson, B. A. (2015). Variable selection in the presence of missing data: Resampling and imputation. Biostatistics 16 596-610. [22] Martinez, M. E., Thompson, P., Messer, K. et al. (2012). One-year risk of advanced colorectal neoplasia: United States vs. United Kingdom risk-stratification guidelines. Ann. Intern. Med. 12 856-864. [23] Meinshausen, N. and Bühlmann, P. (2010). Stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol. 72 417-473. · Zbl 1411.62142 [24] Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics. Wiley, New York. · Zbl 1070.62007 [25] Schomaker, M. and Heumann, C. (2014). Model selection and model averaging after multiple imputation. Comput. Statist. Data Anal. 71 758-770. · Zbl 1471.62181 · doi:10.1016/j.csda.2013.02.017 [26] Schomaker, M. and Heumann, C. (2018). Bootstrap inference when using multiple imputation. Stat. Med. 37 2252-2266. [27] Siegel, R. L., Miller, K. D. and Jemal, A. (2015). Cancer statistics. CA Cancer J. Clin. 65 5-29. [28] Tanner, M. A. and Wong, W. H. (1987). An application of impuation to an estimation problem in grouped lifetime analysis. Technometrics 29 23-32. [29] Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer, New York. · Zbl 1105.62002 [30] van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 45 1-67. [31] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics 3. Cambridge Univ. Press, Cambridge. · Zbl 0910.62001 [32] Wood, A. M., White, I. R. and Royston, P. (2008). How should variable selection be performed with multiply imputed data? Stat. Med. 27 3227-3246. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.