# zbMATH — the first resource for mathematics

Recursive partitioning on incomplete data using surrogate decisions and multiple imputation. (English) Zbl 1243.62092
Summary: The occurrence of missing data is a major problem in statistical data analysis. All scientific fields and data of all kinds and size are touched by this problem. There is a number of ad-hoc solutions which unfortunately lead to a loss of power, biased inference, underestimation of variability and distorted relationships between variables. A more promising approach of rising popularity is multiple imputation by chained equations (MICE) also known as imputation by full conditional specification (FCS). Alternatives to imputation are given by methods with built-in procedures. These include recursive partitioning by classification and regression trees as well as corresponding random forests. However there is only few literature comparing the two approaches. Existing evaluations often lack generalizability due to restrictions on data structure and simulation schemes. The application of both methods to several kinds of data and different simulation settings is meant to improve and extend the comparative analyses. Classification and regression studies are examined. Recursive partitioning is executed by two popular tree and one random forest implementation. Findings show that multiple imputation produces ambiguous performance results for both, simulated and real life data. Using surrogates instead is a fast and simple way to achieve performances which are only negligible worse and in many cases even superior.

##### MSC:
 62H30 Classification and discrimination; cluster analysis (statistical aspects) 62H99 Multivariate analysis 65C60 Computational problems in statistics (MSC2010)
##### Software:
R; MICE; UCI-ml; party; rpart; C4.5; randomForest
Full Text:
##### References:
 [1] Allison, T.; Cicchetti, D.V., Sleep in mammals: ecological and constitutional correlates, Science, 194, 732-734, (1976) [2] Asuncion, A., Newman, D.J., 2007. UCI machine learning repository. [3] Boulesteix, A.L.; Strobl, C.; Augustin, T.; Daumer, M., Evaluating microarray-based classifiers: an overview, Cancer information, 6, 77-97, (2008) [4] Breiman, L., Bagging predictors, Machine learning, 24, 123-140, (1996) · Zbl 0858.68080 [5] Breiman, L., Random forests, Machine learning, 45, 5-32, (2001) · Zbl 1007.68152 [6] Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A., Classification and regression trees, (1984), Chapman & Hall, CRC · Zbl 0541.62042 [7] Bühlmann, P.; Yu, B., Analyzing bagging, Annals of statistics, 30, 927-961, (2002) · Zbl 1029.62037 [8] Burgette, L.F.; Reiter, J.P., Multiple imputation for missing data via sequential regression trees, American journal of epidemiology, 172, 1070-1076, (2010) [9] Chambers, J.M., () [10] Elter, M.; Schulz-Wendtland, R.; Wittenberg, T., The prediction of breast cancer biopsy outcomes using two cad approaches that both emphasize an intelligible decision process, Medical physics, 34, 4164-4172, (2007) [11] Farhangfar, A.; Kurgan, L.; Dy, J., Impact of imputation of missing values on classification error for discrete data, Pattern recognition, 41, 3692-3705, (2008) · Zbl 1173.68479 [12] Feelders, A.J., Handling missing data in trees: surrogate splits or statistical imputation, (), 329-334 [13] Haberman, S.J., 1976. Generalized residuals for log-linear models. In: Proceedings of the 9th International Biometrics Conference, pp. 104-122. [14] Harel, O.; Zhou, X.H., Multiple imputation: review of theory, implementation and software, Statistics in medicine, 26, 3057-3077, (2007) [15] He, Y.; Zaslavsky, A.; Landrum, M.; Harrington, D.; Catalano, P., Multiple imputation in a large-scale complex survey: a practical guide, Statistical methods in medical research, (2009) [16] Hilsenbeck, S.G.; Clark, G.M., Practical $$p$$-value adjustment for optimally selected cutpoints, Statistics in medicine, 15, 103-112, (1996) [17] Horton, N.J.; Kleinman, K.P., Much ado about nothing: a comparison of missing data methods and software to fit incomplete data regression models, The American Statistician, 61, 79-90, (2007) [18] Hothorn, T., Hornik, K., Strobl, C., Zeileis, A., 2008. Party: a laboratory for recursive part(y)itioning. R Package Version 0.9-9993. [19] Hothorn, T.; Hornik, K.; Zeileis, A., Unbiased recursive partitioning, Journal of computational and graphical statistics, 15, 651-674, (2006) [20] Janssen, K.J.; Donders, A.R.; Harrell, F.E.; Vergouwe, Y.; Chen, Q.; Grobbee, D.E.; Moons, K.G., Missing covariate data in medical research: to impute is better than to ignore, Journal of clinical epidemiology, 63, 721-727, (2010) [21] Janssen, K.J.; Vergouwe, Y.; Donders, A.R.; Harrell, F.E.; Chen, Q.; Grobbee, D.E.; Moons, K.G., Dealing with missing predictor values when applying clinical prediction models, Clinical chemistry, 55, 994-1001, (2009) [22] Klebanoff, M.A.; Cole, S.R., Use of multiple imputation in the epidemiologic literature, American journal of epidemiology, 168, 355-357, (2008) [23] Lausen, B.; Sauerbrei, W.; Schumacher, M., Classification and regression trees (cart) used for the exploration of prognostic factors measured on different scales, (), 483-496 [24] Liaw, A.; Wiener, M., Classification and regression by randomforest, R news, 2, 18-22, (2002) [25] Little, R.J.A.; Rubin, D.B., Statistical analysis with missing data, second edition, (2002), Wiley-Interscience [26] Lunetta, K.; Hayward, B.L.; Segal, J.; Van Eerdewegh, P., Screening large-scale association study data: exploiting interactions using random forests, BMC genetics, 5, (2004) [27] Messeri, P.; Lee, G.; Abramson, D.M.; Aidala, A.; Chiasson, M.A.; Jessop, D.J., Antiretroviral therapy and declining aids mortality in New York city, Journal of medical care, 4, 512-521, (2003) [28] Mosteller, F.; Tukey, J.W., Data analysis and regression: A second course in statistics, (1977), Addison-Wesley Pub. Co. [29] Nicodemus, K.; Malley, J.; Strobl, C.; Ziegler, A., The behaviour of random forest permutation-based variable importance measures under predictor correlation, BMC bioinformatics, 11, (2010), 110$$+$$ [30] Pearson, R.K., The problem of disguised missing data, SIGKDD explorations newsletter, 8, 83-92, (2006) [31] Quinlan, J.R., () [32] R Development Core Team, 2010. R: a language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria. ISBN: 3-900051-07-0. [33] Rieger, A., Hothorn, T., Strobl, C., 2010. Random forests with missing values in the covariates. [34] Rubin, D.B., Inference and missing data, Biometrika, 63, 581-592, (1976) · Zbl 0344.62034 [35] Rubin, D.B., Multiple imputation for nonresponse in surveys, (1987), J. Wiley & Sons New York · Zbl 1070.62007 [36] Rubin, D.B., Multiple imputation after 18$$+$$ years, Journal of the American statistical association, 91, 473-489, (1996) · Zbl 0869.62014 [37] Schafer, J.L., Analysis of incomplete multivariate data, (1997), Chapman & Hall · Zbl 0997.62510 [38] Schafer, J.L.; Graham, J.W., Missing data: our view of the state of the art, Psychological methods, 7, 147-177, (2002) [39] Strasser, H.; Weber, C., On the asymptotic theory of permutation statistics, Mathematical methods of statistics, 2, (1999) · Zbl 1103.62346 [40] Strobl, C.; Boulesteix, A.L.; Augustin, T., Unbiased split selection for classification trees based on the gini index, Computational statistics & data analysis, 52, 483-501, (2007) · Zbl 1452.62469 [41] Strobl, C.; Boulesteix, A.L.; Kneib, T.; Augustin, T.; Zeileis, A., Conditional variable importance for random forests, BMC bioinformatics, 9, (2008), 307$$+$$ [42] Strobl, C.; Boulesteix, A.L.; Zeileis, A.; Hothorn, T., Bias in random forest variable importance measures: illustrations, sources and a solution, BMC bioinformatics, 8, (2007), 25$$+$$ [43] Strobl, C.; Malley, J.; Tutz, G., An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests, Psychological methods, 14, 323-348, (2009) [44] Templ, M.; Kowarik, A.; Filzmoser, P., Iterative stepwise regression imputation using standard and robust methods, Computational statistics & data analysis, 55, 2793-2806, (2011) [45] Therneau, T.M., Atkinson, B., 2009. rpart: recursive partitioning. R Package Version 3.1-45; R Port by B. Ripley. [46] van Buuren, S., Multiple imputation of discrete and continuous data by fully conditional specification, Statistical methods in medical research, 16, 219-242, (2007) · Zbl 1122.62382 [47] Van Buuren, S.; Brand, J.P.L.; Groothuis-Oudshoorn, C.G.M.; Rubin, D.B., Fully conditional specification in multivariate imputation, Journal of statistical computation and simulation, 76, 1049-1064, (2006) · Zbl 1144.62332 [48] van Buuren, S., Groothuis-Oudshoorn, K., 2011. Mice: multivariate imputation by chained equations in r. Journal of Statistical Software, pp. 1-68 (in press). [49] Venables, W.N.; Ripley, B.D., Modern applied statistics with S, (2003), Springer New York, USA · Zbl 1006.62003
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.