×

zbMATH — the first resource for mathematics

A comparison of various software tools for dealing with missing data via imputation. (English) Zbl 1431.62007
Summary: In real-life situations, we often encounter data sets containing missing observations. Statistical methods that address missingness have been extensively studied in recent years. One of the more popular approaches involves imputation of the missing values prior to the analysis, thereby rendering the data complete. Imputation broadly encompasses an entire scope of techniques that have been developed to make inferences about incomplete data, ranging from very simple strategies (e.g. mean imputation) to more advanced approaches that require estimation, for instance, of posterior distributions using Markov chain Monte Carlo methods. Additional complexity arises when the number of missingness patterns increases and/or when both categorical and continuous random variables are involved. Implementation of routines, procedures, or packages capable of generating imputations for incomplete data are now widely available. We review some of these in the context of a motivating example, as well as in a simulation study, under two missingness mechanisms (missing at random and missing not at random). Thus far, evaluation of existing implementations have frequently centred on the resulting parameter estimates of the prescribed model of interest after imputing the missing data. In some situations, however, interest may very well be on the quality of the imputed values at the level of the individual – an issue that has received relatively little attention. In this paper, we focus on the latter to provide further insight about the performance of the different routines, procedures, and packages in this respect.

MSC:
62-04 Software, source code, etc. for problems pertaining to statistics
62D05 Sampling theory, sample surveys
62D10 Missing data
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] DOI: 10.1093/biomet/63.3.581 · Zbl 0344.62034 · doi:10.1093/biomet/63.3.581
[2] DOI: 10.1002/9780470510445 · doi:10.1002/9780470510445
[3] Little R. J.A., Statistical Analysis With Missing Data, 2. ed. (2002) · Zbl 1011.62004
[4] DOI: 10.1201/9781439821862 · doi:10.1201/9781439821862
[5] Allison P. D., Missing Data (2002)
[6] DOI: 10.2307/2290664 · doi:10.2307/2290664
[7] DOI: 10.2307/2290013 · doi:10.2307/2290013
[8] DOI: 10.2307/2669781 · Zbl 1180.62012 · doi:10.2307/2669781
[9] DOI: 10.1146/annurev.publhealth.25.102802.124410 · doi:10.1146/annurev.publhealth.25.102802.124410
[10] DOI: 10.1198/016214504000001844 · Zbl 1117.62360 · doi:10.1198/016214504000001844
[11] DOI: 10.1198/000313007X172556 · Zbl 05680721 · doi:10.1198/000313007X172556
[12] DOI: 10.1198/000313001317098266 · Zbl 05680456 · doi:10.1198/000313001317098266
[13] Siddique J., J. Statist. Softw. 29 pp 1– (2009)
[14] DOI: 10.2307/2986113 · Zbl 0825.62010 · doi:10.2307/2986113
[15] DOI: 10.1038/sj.bjc.6601907 · doi:10.1038/sj.bjc.6601907
[16] DOI: 10.1056/NEJM200511033531823 · doi:10.1056/NEJM200511033531823
[17] DOI: 10.1080/10543400903105406 · doi:10.1080/10543400903105406
[18] Rubin D. B., Imputation and Editing of Faulty or Missing Survey Data pp 1– (1978)
[19] DOI: 10.1002/9780470316696 · doi:10.1002/9780470316696
[20] DOI: 10.1093/biomet/84.1.33 · Zbl 0883.62120 · doi:10.1093/biomet/84.1.33
[21] Honaker J., Amelia Software Web Site (2006)
[22] King G., Am. Polit. Sci. Rev. 95 pp 49– (2001)
[23] Dempster A. P., J. R. Statist. Soc. B 39 pp 1– (1977)
[24] Harrell F. E., The Hmisc Package [accessed November 20, 2008] (2009)
[25] Breiman L., Classification and regression trees (1984) · Zbl 0541.62042
[26] DOI: 10.1023/A:1010933404324 · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[27] Breiman L., Manual for Setting Up, Using, and Understanding Random Forest v4.0 [accessed November 20, 2008] (2003)
[28] The MI Procedure [accessed November 20, 2008] (2004)
[29] DOI: 10.1093/biomet/70.1.41 · Zbl 0522.62091 · doi:10.1093/biomet/70.1.41
[30] Celeux G., Comput. Statist. Q. 2 pp 73– (1985)
[31] Diebolt J., Markov Chain Monte Carlo in Practice (1996)
[32] Jolani, S. and Ganjali, M. Paper presented at the 56th Session of the International Statistical Institute. Lisbon, Portugal. Analysis of longitudinal continuous response data with dropout: Use of stochastic EM algorithm,
[33] DOI: 10.1016/j.csda.2005.04.006 · Zbl 1445.62043 · doi:10.1016/j.csda.2005.04.006
[34] DOI: 10.1016/j.csda.2010.04.026 · Zbl 1247.62142 · doi:10.1016/j.csda.2010.04.026
[35] DOI: 10.1016/S0304-4076(96)01818-0 · Zbl 0877.62097 · doi:10.1016/S0304-4076(96)01818-0
[36] Gill, R. D., van der Laan, M. J. and Robins, J. M. Proceedings of the First Seattle Symposium in Biostatistics: Survival Analysis. Coarsening at random: Characterizations, conjectures and counterexamples, Edited by: Lin, D. Y. and Fleming, T. R. New York: Springer.
[37] Laird N. M., J. R. Statist. Soc. C 43 pp 84– (1994)
[38] DOI: 10.1111/j.1467-9868.2007.00640.x · Zbl 1148.62046 · doi:10.1111/j.1467-9868.2007.00640.x
[39] Arch. Ophthalmol. 115 pp 865– (1997) · doi:10.1001/archopht.1997.01100160035005
[40] DOI: 10.2307/2533853 · Zbl 1058.62585 · doi:10.2307/2533853
[41] DOI: 10.1177/1740774508091677 · doi:10.1177/1740774508091677
[42] Molenberghs G., Models for Discrete Longitudinal Data (2005) · Zbl 1093.62002
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.