A simulation comparison of imputation methods for quantitative data in the presence of multiple data patterns. (English) Zbl 07192731

Summary: An extensive investigation via simulation is carried out with the aim of comparing three nonparametric, single imputation methods in the presence of multiple data patterns. The ultimate goal is to provide useful hints for users needing to quickly pick the most effective imputation method among the following: Forward Imputation (ForImp), considered in the two variants of ForImp with the principal component analysis (PCA), which alternates the use of PCA and the Nearest-Neighbour Imputation (NNI) method in a forward, sequential procedure, and ForImp with the Mahalanobis distance, which involves the use of the Mahalanobis distance when performing NNI; the iterative PCA technique, which imputes missing values simultaneously via PCA; the missForest method, which is based on random forests and is developed for mixed-type data. The performance of these methods is compared under several data patterns characterized by different levels of kurtosis or skewness and correlation structures.


62H25 Factor analysis and principal components; correspondence analysis
62-07 Data analysis (statistics) (MSC2010)
62-04 Software, source code, etc. for problems pertaining to statistics
62H99 Multivariate analysis
Full Text: DOI


[1] Efron B.Bootstrap methods: another look at the jackknife. Ann Stat. 1979;7(1):1-26. doi: 10.1214/aos/1176344552[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0406.62024
[2] Little RJA, Rubin DB. Statistical analysis with missing data. 2nd ed. New York: Wiley; 2002. [Crossref], [Google Scholar] · Zbl 1011.62004
[3] Schafer JL. Analysis of incomplete multivariate data. London: Chapman and Hall/CRC; 1997. [Crossref], [Google Scholar]
[4] Molenberghs G, Kenward MG. Missing data in clinical studies. Chichester: Wiley; 2007. [Crossref], [Google Scholar]
[5] Haziza D. Imputation and inference in the presence of missing data. In: Pfeffermann D, Rao CR, editors. Sample surveys: design, methods and applications. 29A. Amsterdam, North Holland: Handbook of Statistics; 2009. p. 215-246. [Google Scholar]
[6] Bello AL.Choosing among imputation techniques for incomplete multivariate data: a simulation study. Commun Stat-Theor M. 1993;22(3):853-877. doi: 10.1080/03610929308831061[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0800.62344
[7] Bello AL.A simulation study of imputation techniques in linear quadratic and kernel discriminant analyses. J Stat Comput Sim. 1993;48((3-4)):167-180. doi: 10.1080/00949659308811549[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0832.62055
[8] Marella D, Scanu M, Conti PL.On the matching noise of some nonparametric imputation procedures. Stat Probab Lett. 2008;78(12):1593-1600. doi: 10.1016/j.spl.2008.01.020[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1325.62092
[9] Ning J, Cheng PE.A comparison study of nonparametric imputation methods. Stat Comput. 2012;22(1):273-285. doi: 10.1007/s11222-010-9223-y[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1322.62124
[10] Tutz G, Ramzan S.Improved methods for the imputation of missing data by nearest neighbor methods. Comput Stat Data Anal. 2015;90:84-99. doi: 10.1016/j.csda.2015.04.009[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1468.62198
[11] Ferrari PA, Annoni P, Barbiero A, et al. An imputation method for categorical variables with application to nonlinear principal component analysis. Comput Stat Data Anal. 2011;55:2410-2420. doi: 10.1016/j.csda.2011.02.007[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1328.65028
[12] Solaro N, Barbiero A, Manzi G, et al. A sequential distance-based approach for imputing missing data: forward imputation. Adv Data Anal Classi. 2017;11:395-414. doi: 10.1007/s11634-016-0243-0[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1414.62220
[13] Nora-Chouteau C. Une méthode de reconstitution et d’analyse de données incomplètes [dissertation]. Paris: Université Pierre et Marie Curie; 1974. [Google Scholar]
[14] Greenacre M. Theory and applications of correspondence analysis. London: Academic Press; 1984. [Google Scholar] · Zbl 0555.62005
[15] Josse J, Pagès J, Husson F.Multiple imputation in principal component analysis. Adv Data Anal Classi. 2011;5:231-246. doi: 10.1007/s11634-011-0086-7[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1274.62409
[16] Stekhoven DJ, Bühlmann P.MissForest - nonparametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112-118. doi: 10.1093/bioinformatics/btr597[Crossref], [PubMed], [Web of Science ®], [Google Scholar]
[17] Breiman L.Random forests. Mach Learn. 2001;45:5-32. doi: 10.1023/A:1010933404324[Crossref], [Web of Science ®], [Google Scholar] · Zbl 1007.68152
[18] Gómez E, Gómez-Villegas MA, Marin JM.A multivariate generalization of the power exponential family of distributions. Commun Stat-Theor M. 1998;27(3):589-600. doi: 10.1080/03610929808832115[Taylor & Francis Online], [Web of Science ®], [Google Scholar] · Zbl 0895.62053
[19] Azzalini A, Capitanio A.Statistical applications of the multivariate skew normal distribution. J R Stat Soc B. 1999;61(3):579-602. doi: 10.1111/1467-9868.00194[Crossref], [Google Scholar] · Zbl 0924.62050
[20] Azzalini A, Dalla Valle A.The multivariate skew-normal distribution. Biometrika. 1996;83(4):715-726. doi: 10.1093/biomet/83.4.715[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0885.62062
[21] Mardia KV.Measures of multivariate skewness and kurtosis with applications. Biometrika. 1970;57(3):519-530. doi: 10.1093/biomet/57.3.519[Crossref], [Web of Science ®], [Google Scholar] · Zbl 0214.46302
[22] Solaro N.Random variate generation from multivariate exponential power distribution. Stat Appl. 2004;2(2):25-44. [Google Scholar]
[23] Azzalini A. Package ‘sn’: The skew-normal and related distributions, such as the skew-t. 2017 - [R package version 1.5-0]. Available from: https://CRAN.R-project.org/package=sn. [Google Scholar]
[24] Seber GAF. Multivariate observations. New York: Wiley; 1984. [Crossref], [Google Scholar]
[25] Kaiser HF.A measure of the average intercorrelation. Educ Psychol Meas. 1968;28:245-247. doi: 10.1177/001316446802800203[Crossref], [Web of Science ®], [Google Scholar]
[26] Solaro N, Barbiero A, Manzi G, et al. Package ‘GenForImp’: The Forward Imputation - a sequential distance-based approach for imputing missing data. 2015 - [R package version 1.0.0]. Available from: http://CRAN.R-project.org/package=GenForImp. [Google Scholar] · Zbl 1414.62220
[27] Husson F, Josse J. Package ‘missMDA’: Handling missing values with multivariate data analysis. 2017 - [R package version 1.11]. Available from: http://CRAN.R-project.org/package=missMDA. [Google Scholar] · Zbl 1316.62006
[28] Stekhoven DJ. Package ‘missForest’: Nonparametric missing value imputation using random forest. 2016 - [R package version 1.4]. Available from: http://CRAN.R-project.org/package=missForest. [Google Scholar]
[29] Hochberg Y, Tamhane AC. Multiple comparison procedures. New York: Wiley; 1987. [Crossref], [Google Scholar] · Zbl 0731.62125
[30] Hollander M, Wolfe DA. Nonparametric statistical methods. 2nd ed. New York: Wiley; 1999. [Google Scholar] · Zbl 0997.62511
[31] Solaro N, Barbiero A, Manzi G, et al. Algorithmic-type imputation techniques with different data structures: alternative approaches in comparison. In: Vicari D, Okada A, Ragozini G, Weihs C, editors. Analysis and modeling of complex data in behavioural and social sciences. Studies in Classification, Data Analysis, and Knowledge Organization. Cham (CH): Springer International Publishing; 2014. p. 253-261. [Google Scholar]
[32] Solaro N, Barbiero A, Manzi G, et al. A comprehensive simulation study on the Forward Imputation. Milan (IT): Università degli Studi di Milano; 2015. (DEMM working paper; no. 2015-04). Available from: https://ideas.repec.org/p/mil/wpdepa/2015-04.html. [Google Scholar]
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.