Predicting missing values: a comparative study on non-parametric approaches for imputation. (English) Zbl 07148717

Summary: Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm, BART or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. An empirical analysis focusing on credit information and Facebook data complements our investigations.


65C60 Computational problems in statistics (MSC2010)


R; C4.5; gbm; BartPy; missForest; C50; MICE
Full Text: DOI


[1] Amro L, Pauly M (2017) Permuting incomplete paired data: a novel exact and asymptotic correct randomization test. J Stat Comput Simul 87(6):1148-1159
[2] Breiman L (2001) Random forests. Mach Learn 45(1):5-32 · Zbl 1007.68152
[3] Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. The Wadsworth and Brooks-Cole Statistics-Probability Series. Taylor & Francis, Monterey · Zbl 0541.62042
[4] Brunner E, Munzel U (2000) The nonparametric Behrens-Fisher problem: asymptotic theory and a small-sample approximation. Biometrical J 42(1):17-25 · Zbl 0969.62033
[5] Bujlow T, Riaz T, Pedersen JM (2012) A method for classification of network traffic based on C5.0 machine learning algorithm. In: International conference on computing, networking and communications. IEEE Press, pp 237-241
[6] Chacón JE, Duong T, Wand MP (2011) Asymptotics for general multivariate kernel density derivative estimators. Stat Sin 21(2):807-840 · Zbl 1214.62039
[7] Chipman HA, George EI, McCulloch RE (2010) BART: Bayesian additive regression trees. Ann Appl Stat 4(1):266-298 · Zbl 1189.62066
[8] Conversano C, Siciliano R (2009) Incremental tree-based missing data imputation with lexicographic ordering. J Classif 26(3):361-379 · Zbl 1337.62128
[9] Dougherty, James; Kohavi, Ron; Sahami, Mehran, Supervised and Unsupervised Discretization of Continuous Features, 194-202 (1995)
[10] Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189-1232 · Zbl 1043.62034
[11] Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367-378 · Zbl 1072.65502
[12] Greenwell B, Boehmke B, Cunningham J, Developers G (2018) gbm: Generalized boosted regression models. https://CRAN.R-project.org/package=gbm, R package version 2.1.4
[13] Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York · Zbl 1273.62005
[14] Kaiser S, Dominik T, Leisch F (2011) Generating correlated ordinal random variables. Department of Statistics, University of Munich, Technical Reports, 94
[15] Khan SS, Ahmad A, Mihailidis A (2018) Bootstrapping and multiple imputation ensemble approaches for missing data. arXiv preprint arXiv:180200154
[16] Konietschke F, Harrar SW, Lange K, Brunner E (2012) Ranking procedures for matched pairs with missing data—asymptotic theory and a small sample approximation. Comput Stat Data Anal 56(5):1090-1102 · Zbl 1241.62066
[17] Konietschke F, Bathke A, Harrar S, Pauly M (2015) Parametric and nonparametric bootstrap methods for general MANOVA. J Multivar Anal 140:291-301 · Zbl 1327.62273
[18] Krishnamoorthy K, Lu F (2010) A parametric bootstrap solution to the MANOVA under heteroscedasticity. J Stat Comput Simul 80(8):873-887 · Zbl 1195.62095
[19] Kuhn M, Quinlan R (2018) C50: C5.0 decision trees and rule-based models. https://CRAN.R-project.org/package=C50, R package version 0.1.2
[20] Little RJ, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, Hoboken
[21] Loh WY (2009) Improving the precision of classification trees. Ann Appl Stat 3(4):1710-1737 · Zbl 1184.62109
[22] Loh WY, Eltinge J, Cho M, Li Y (2016) Classification and regression tree methods for incomplete data from sample surveys. arXiv preprint arXiv:160301631
[23] Müller HG, Petersen A (2016) Density estimation including examples, Wiley StatsRef: Statistics Reference Online, pp 1-12. https://doi.org/10.1002/9781118445112.stat02808.pub2
[24] R Core Team (2016) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
[25] Ramosaj B, Amro L, Pauly M (2018) A cautionary tale on using imputation methods for inference in matched pairs design. arXiv preprint arXiv:180606551
[26] Rubin DB (1976) Inference and missing data. Biometrika 63(3):581-592 · Zbl 0344.62034
[27] Schafer JL (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC, New York
[28] Smaga Ł (2017) Bootstrap methods for multivariate hypothesis testing. Commun Stat Simul Comput 46(10):7654-7667 · Zbl 1381.62117
[29] Stekhoven DJ (2011) Using the missForest Package. Seminar für Statistik, ETH Zürich, Technical Report pp 1-11. https://stat.ethz.ch/education/semesters/ss2012/ams/paper/missForest_1.2.pdf
[30] Stekhoven DJ, Bühlmann P (2011) MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112-118
[31] Strobl C, Boulesteix AL, Augustin T (2007) Unbiased split selection for classification trees based on the Gini index. Comput Stat Data Anal 52(1):483-501 · Zbl 1452.62469
[32] Sun K, Mou S, Qiu J, Wang T, Gao H (2018) Adaptive fuzzy control for non-triangular structural stochastic switched nonlinear systems with full state constraints. IEEE Trans Fuzzy Syst. https://doi.org/10.1109/TFUZZ.2018.2883374
[33] Tan YV, Flannagan CA, Elliott MR (2018) “Robust-squared” imputation models using BART. arXiv preprint arXiv:180103147
[34] Vach, Werner, Missing Values: Statistical Theory and Computational Practice, 345-354 (1994), Heidelberg · Zbl 0900.65410
[35] Van Buuren S (2011) Multiple imputation of multilevel data. In: Handbook of advanced multilevel analysis, Routledge/Taylor & Francis, New York, NY, pp 173-196
[36] Van Buuren S, Groothuis-Oudshoorn K (2011) mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1-67. https://www.jstatsoft.org/v45/i03/
[37] Waljee AK, Mukherjee A, Signal AG, Zhang Y, Warren J, Balis U, Marrero J, Zhu J, Higgind PD (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open. https://doi.org/10.1136/bmjopen-2013-002847
[38] Wand MP, Jones MC (1994) Multivariate plug-in bandwidth selection. Comput Stat 9(2):97-116 · Zbl 0937.62055
[39] Xu J, Harrar SW (2012) Accurate mean comparisons for paired samples with missing data: an application to a smoking-cessation trial. Biometrical J 54(2):281-295 · Zbl 1242.62125
[40] Xu LW, Yang FQ, Abula A, Qin S (2013) A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances. J Multivar Anal 115:172-180 · Zbl 1258.62034
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.