# zbMATH — the first resource for mathematics

Evaluation of four multiple imputation methods for handling missing binary outcome data in the presence of an interaction between a dummy and a continuous variable. (English) Zbl 07367194
Summary: Multiple imputation by chained equations (MICE) is the most common method for imputing missing data. In the MICE algorithm, imputation can be performed using a variety of parametric and nonparametric methods. The default setting in the implementation of MICE is for imputation models to include variables as linear terms only with no interactions, but omission of interaction terms may lead to biased results. It is investigated, using simulated and real datasets, whether recursive partitioning creates appropriate variability between imputations and unbiased parameter estimates with appropriate confidence intervals. We compared four multiple imputation (MI) methods on a real and a simulated dataset. MI methods included using predictive mean matching with an interaction term in the imputation model in MICE (MICE-interaction), classification and regression tree (CART) for specifying the imputation model in MICE (MICE-CART), the implementation of random forest (RF) in MICE (MICE-RF), and MICE-Stratified method. We first selected secondary data and devised an experimental design that consisted of 40 scenarios $$(2 \times 5 \times 4)$$, which differed by the rate of simulated missing data (10%, 20%, 30%, 40%, and 50%), the missing mechanism (MAR and MCAR), and imputation method (MICE-Interaction, MICE-CART, MICE-RF, and MICE-Stratified). First, we randomly drew 700 observations with replacement 300 times, and then the missing data were created. The evaluation was based on raw bias (RB) as well as five other measurements that were averaged over the repetitions. Next, in a simulation study, we generated data 1000 times with a sample size of 700. Then, we created missing data for each dataset once. For all scenarios, the same criteria were used as for real data to evaluate the performance of methods in the simulation study. It is concluded that, when there is an interaction effect between a dummy and a continuous predictor, substantial gains are possible by using recursive partitioning for imputation compared to parametric methods, and also, the MICE-Interaction method is always more efficient and convenient to preserve interaction effects than the other methods.
##### MSC:
 62D10 Missing data 62H30 Classification and discrimination; cluster analysis (statistical aspects)
##### Software:
rpart; MICE; Stata; ice; mi; ElemStatLearn; randomForest
Full Text:
##### References:
 [1] Sterne, J. A. C.; White, I. R.; Carlin, J. B., Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls, BMJ, 338, 1 (2009) [2] Rubin, D. B., Multiple Imputation for Nonresponse in Surveys (2004), Hoboken, NJ, USA: John Wiley & Sons, Hoboken, NJ, USA · Zbl 1070.62007 [3] Van Buuren, S., Flexible Imputation of Missing Data (2018), Boca Raton, FL, USA: CRC Press, Boca Raton, FL, USA · Zbl 1416.62030 [4] Rubin, D. B., Multiple imputation after 18+ years, Journal of the American Statistical Association, 91, 434, 473-489 (1996) · Zbl 0869.62014 [5] Barnard, J.; Meng, X.-L., Applications of multiple imputation in medical studies: from AIDS to NHANES, Statistical Methods in Medical Research, 8, 1, 17-36 (1999) [6] Little, R. J.; Rubin, D. B., Statistical Analysis with Missing Data (2019), Hoboken, NJ, USA: John Wiley & Sons, Hoboken, NJ, USA · Zbl 1411.62006 [7] Van Buuren, S.; Oudshoorn, K., Flexible Mutlivariate Imputation by MICE (1999), Leiden, Netherlands: TNO, Leiden, Netherlands [8] Van Buuren, S., Multiple imputation of discrete and continuous data by fully conditional specification, Statistical Methods in Medical Research, 16, 3, 219-242 (2007) · Zbl 1122.62382 [9] Liu, J. S., Monte Carlo Strategies in Scientific Computing (2008), Berlin, Germany: Springer Science & Business Media, Berlin, Germany · Zbl 1132.65003 [10] Li, F.; Baccini, M.; Mealli, F.; Zell, E. R.; Frangakis, C. E.; Rubin, D. B., Multiple imputation by ordered monotone blocks with application to the anthrax vaccine research program, Journal of Computational and Graphical Statistics, 23, 3, 877-892 (2014) [11] Raghunathan, T. E.; Rubin, D. B., Roles for Bayesian techniques in survey sampling, Proceedings of the Silver Jubilee Meeting of the Statistical Society of Canada [12] Buuren, S.; Groothuis-Oudshoorn, K., MICE: multivariate imputation by chained equations in R, Journal of Statistical Software, 45, 3, 1-68 (2010) [13] Yang, S., Flexible Imputation of Missing Data (2018), Boca Raton, FL, USA: Chapman & Hall/CRC Press, Boca Raton, FL, USA [14] Su, Y.-S.; Gelman, A. E.; Hill, J.; Yajima, M., Multiple imputation with diagnostics (Mi) in R: opening windows into the black box, Journal of Statistical Software, 45, 2, 1-31 (2011) [15] Royston, P.; White, I. R., Multiple imputation by chained equations (MICE): implementation in Stata, Journal of Statistical Software, 45, 4, 1-20 (2011) [16] Seaman, S. R.; Bartlett, J. W.; White, I. R., Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods, BMC Medical Research Methodology, 12, 1, 46 (2012) [17] Morgan, J. N.; Sonquist, J. A., Problems in the analysis of survey data, and a proposal, Journal of the American Statistical Association, 58, 302, 415-434 (1963) · Zbl 0114.10103 [18] Burgette, L. F.; Reiter, J. P., Multiple imputation for missing data via sequential regression trees, American Journal of Epidemiology, 172, 9, 1070-1076 (2010) [19] Schafer, J. L., Analysis of Incomplete Multivariate Data (1997), Boca Raton, FL, USA: Chapman and Hall/CRC, Boca Raton, FL, USA · Zbl 0997.62510 [20] Doove, L. L.; Van Buuren, S.; Dusseldorp, E., Recursive partitioning for missing data imputation in the presence of interaction effects, Computational Statistics & Data Analysis, 72, 92-104 (2014) · Zbl 06983893 [21] Shah, A. D.; Bartlett, J. W.; Carpenter, J.; Nicholas, O.; Hemingway, H., Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study, American Journal of Epidemiology, 179, 6, 764-774 (2014) [22] Therneau, T.; Atkinson, B.; Ripley, B., Recursive Partitioning and Regression Trees. R Package ‘rpart’ (Version 4.1-11) (2017), Vienna, Austria: R. Found Statistical Computing, Vienna, Austria [23] Liaw, A.; Wiener, M., Classification and regression by randomForest, R News, 2, 3, 18-22 (2002) [24] Strobl, C.; Boulesteix, A.-L.; Augustin, T., Unbiased split selection for classification trees based on the Gini index, Computational Statistics & Data Analysis, 52, 1, 483-501 (2007) · Zbl 1452.62469 [25] Friedman, J.; Hastie, T.; Tibshirani, R., The Elements of Statistical Learning (2001), New York, NY, USA: Springer Series in Statistics, New York, NY, USA [26] Breiman, L. F. J.; Olshen, R. A., Classification and Regression Trees (1984), Boca Raton, FL, USA: Chapman and Hall/CRC, Boca Raton, FL, USA · Zbl 0541.62042 [27] Garrusi, B.; Garousi, S.; Baneshi, M. R., Body image and body change: predictive factors in an Iranian population, International Journal of Preventive Medicine, 4, 8, 940-948 (2013) [28] Schafer, J. L., Multiple imputation: a primer, Statistical Methods in Medical Research, 8, 1, 3-15 (1999) [29] Demirtas, H.; Freels, S. A.; Yucel, R. M., Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment, Journal of Statistical Computation and Simulation, 78, 1, 69-84 (2008) · Zbl 1133.62337 [30] Demirtas, H., Simulation driven inferences for multiply imputed longitudinal datasets, Statistica Neerlandica, 58, 4, 466-482 (2004) · Zbl 1066.65020 [31] Collins, L. M.; Schafer, J. L.; Kam, C.-M., A comparison of inclusive and restrictive strategies in modern missing data procedures, Psychological Methods, 6, 4, 330-351 (2001) [32] Demirtas, H.; Hedeker, D., Multiple imputation under power polynomials, Communications in Statistics—Simulation and Computation, 37, 8, 1682-1695 (2008) [33] Rubin, D., Multiple Imputation for Nonresponse in Surveys (1987), New York, NY, USA: John Wiley & Sons, New York, NY, USA [34] Bernaards, C. A.; Farmer, M. M.; Qi, K.; Dulai, G. S.; Ganz, P. A.; Kahn, K. L., Comparison of two multiple imputation procedures in a cancer screening survey (2002) [35] StataCorp LLC, Stata Multiple-Imputation Reference Manual (2013), College Station, TX, USA: StataCorp LLC, College Station, TX, USA [36] Hastie, T.; Tibshirani, R.; Friedman, J., The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2009), Berlin, Germany: Springer Science & Business Media, Berlin, Germany · Zbl 1273.62005 [37] Dusseldorp, E.; Conversano, C.; Van Os, B. J., Combining an additive and tree-based regression model simultaneously: STIMA, Journal of Computational and Graphical Statistics, 19, 3, 514-530 (2010) [38] Meng, X.-L., Multiple-imputation inferences with uncongenial sources of input, Statistical Science, 9, 4, 538-558 (1994) [39] Bartlett, J. W.; Seaman, S. R.; White, I. R.; Carpenter, J. R., Multiple imputation of covariates by fully conditional specification: accommodating the substantive model, Statistical Methods in Medical Research, 24, 4, 462-487 (2015) [40] Slade, E.; Naylor, M. G., A fair comparison of tree‐based and parametric methods in multiple imputation by chained equations, Statistics in Medicine, 39, 8, 1156-1166 (2020) [41] Strobl, C.; Malley, J.; Tutz, G., An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests, Psychological Methods, 14, 4, 323-348 (2009)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.