An imputation method for categorical variables with application to nonlinear principal component analysis. (English) Zbl 1328.65028

Summary: The problem of missing data in building multidimensional composite indicators is a delicate problem which is often underrated. An imputation method particularly suitable for categorical data is proposed. This method is discussed in detail in the framework of nonlinear principal component analysis and compared to other missing data treatments which are commonly used in this analysis. Its performance vs. these other methods is evaluated throughout a simulation procedure performed on both an artificial case, varying the experimental conditions, and a real case. The proposed procedure is implemented using R.


65C60 Computational problems in statistics (MSC2010)
62H25 Factor analysis and principal components; correspondence analysis


BayesDA; R; impute
Full Text: DOI


[1] Carpenter, J.R.; Kenward, M.G.; Vansteelandt, S., A comparison of multiple imputation and doubly robust estimation for analyses with missing data, Journal of the royal statistical society: series A, 169, 571-584, (2006)
[2] Chen, G.; Åstebro, T., How to deal with missing categorical data: test of a simple Bayesian method, Organizational research methods, 6, 3, 309-327, (2003)
[3] Deville, J.C.; Tillé, Y., Unequal probability sampling without replacement through a splitting method, Biometrika, 85, 89-101, (1998) · Zbl 1067.62508
[4] Ferrari, P.A.; Annoni, P.; Manzi, G., Evaluation and comparison of European countries: public opinion on services, Quality and quantity: international journal of methodology, (2009)
[5] Ferrari, P.A.; Annoni, P.; Urbisci, S., A proposal for setting up vulnerability indicators in the presence of missing data, Statistica & applicazioni, 4, 1, 73-88, (2006)
[6] Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B., Bayesian data analysis, (1995), Chapman & Hall New York
[7] Gifi, A., Nonlinear multivariate analysis, (1990), Wiley New York · Zbl 0697.62048
[8] Hron, R.; Templ, M.; Filzmoser, P., Imputation of missing values for compositional data using classical and robust methods, Computational statistics & data analysis, (2009) · Zbl 1284.62049
[9] Jung, H.; Schafer, J.L.; Seo, B., A latent class selection model for nonignorably missing data, Computational statistics & data analysis, (2010)
[10] Little, R.J.A.; Rubin, D.B., Statistical analysis with missing data, (2002), John Wiley & Sons, Inc.
[11] Michailidis, G.; de Leeuw, J., The gifi system of descriptive multivariate analysis, Statistical science, 13, 4, 307-336, (1998) · Zbl 1059.62551
[12] Page, E.B., Ordered hypotheses for multiple treatments: a significance test for linear ranks, Journal of the American statistical association, 58, 301, 216-230, (1963) · Zbl 0114.11102
[13] Patil, G.P.; Taille, C., Multiple indicators, partially ordered sets and linear extensions: multi-criterion ranking and prioritization, Environmental and ecological statistics, 11, 199-228, (2004)
[14] Paul, C.; Mason, W.M.; McCaffrey, D.; Fox, S.A., A cautionary case study of approaches to the treatment of missing data, Statistical methods and applications, 17, 351-372, (2008) · Zbl 1367.62297
[15] Rubin, D.B., Inference and missing data, Biometrika, 63, 581-592, (1976) · Zbl 0344.62034
[16] Serneels, S.; Verdonck, T., Principal component regression for data containing outliers and missing elements, Computational statistics & data analysis, 53, 3855-3863, (2009) · Zbl 1453.62193
[17] Siddique, J.; Belin, T.R., Using an approximate Bayesian bootstrap to multiply impute nonignorable missing data, Computational statistics & data analysis, 53, 405-415, (2008) · Zbl 1231.62037
[18] Srivastava, M.S., Methods of multivariate statistics, (2002), John Wiley & Sons, Inc. · Zbl 1006.62048
[19] Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B., Missing value estimation methods for DNA microarrays, Bioinformatics, 17, 6, 520-525, (2001)
[20] Walczak, B.; Massart, D.L., Dealing with missing data: part I, Chemometrics and intelligent laboratory systems, 58, 15-27, (2001)
[21] Walczak, B.; Massart, D.L., Dealing with missing data: part II, Chemometrics and intelligent laboratory systems, 58, 29-42, (2001)
[22] Wasito, I.; Mirkin, B., Nearest neighbour approach in the least-squares data imputation algorithms, Information sciences, 169, 1, 1-25, (2005) · Zbl 1084.62043
[23] White, I.R.; Daniel, R.; Royston, P., Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables, Computational statistics & data analysis, 54, 2267-2275, (2010) · Zbl 1284.62068
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.