Comparisons among several methods for handling missing data in principal component analysis (PCA). (English) Zbl 07073915

Summary: Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.


62H25 Factor analysis and principal components; correspondence analysis
62R07 Statistical aspects of big data and data science


Full Text: DOI


[1] Bergami, M.; Bagozzi, RP, Self-categorization, affective commitment and group-esteem as distinct aspects of social identity in the organization, Brit J Soc Psychol, 39, 555-577, (2000)
[2] Bernaards, CA; Sijtsma, K., Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable, Multivar Behav Res, 35, 321-364, (2000)
[3] Dray, S.; Josse, J., Principal component analysis with missing values: a comparative survey of methods, Plant Ecol, 216, 657-667, (2015)
[4] Folch-Fortuny, A.; Arteaga, F.; Ferrer, A., PCA model building with missing data, Chemom Intell Lab, 146, 77-88, (2015)
[5] Folch-Fortuny, A.; Arteaga, F.; Ferrer, A., Missing data imputation toolbox for MATLAB, Chemom Intell Lab, 154, 93-100, (2016)
[6] Gabriel, KR; Zamir, S., Lower rank approximation of matrices by least squares with any choice of weights, Technometrics, 22, 489-498, (1979) · Zbl 0471.62004
[7] Gifi A (1990) Nonlinear multivariate analysis. Wiley, Chichester · Zbl 0697.62048
[8] Grung, B.; Manne, R., Missing values in principal component analysis, Chemom Intell Lab, 42, 125-139, (1998)
[9] Hwang H, Takane Y (2014) Generalized structured component analysis: a component-based approach to structural equation modeling. Chapman and Hall/CRC Press, Boca Raton · Zbl 1341.62033
[10] Ilin, A.; Raiko, T., Practical approaches to principal component analysis in the presence of missing values, J Mach Learn Res, 11, 1957-2000, (2010) · Zbl 1242.62047
[11] Josse, J.; Husson, F.; Pagès, J., Gestion des données manquantes en analyse en composantes principales, J de la Société Française de Statistique, 150, 28-51, (2009) · Zbl 1311.62091
[12] Josse, J.; Husson, F., Handling missing values in exploratory multivariate data analysis methods, J de la Société Française de Statistique, 153, 79-99, (2012) · Zbl 1316.62006
[13] Josse, J.; Timmerman, ME; Kiers, HAL, Missing values in multi-level simultaneous component analysis, Chemom Intell Lab, 129, 21-32, (2013)
[14] Kiers, HAL, Weighted least squares fitting using iterative ordinary least squares algorithms, Psychometrika, 62, 251-266, (1997) · Zbl 0873.62058
[15] Little RJA, Rubin DB (1987) Statistical analysis with missing data. Wiley, New York · Zbl 0665.62004
[16] McDonald, RP; Burr, EJ, A comparison of four methods of constructing factor scores, Psychometrika, 32, 381-401, (1967) · Zbl 0183.24602
[17] Meulman JJ (1982) Homogeneity analysis of incomplete data. DSWO Press, Leiden
[18] Mezzich, JE, Evaluating clustering methods for psychiatric diagnosis, Biol Psychol, 13, 265-281, (1978)
[19] Mori, Y.; Iizuka, M.; Tarumi, T.; Tanaka, Y.; Härdle, W. (ed.); Mori, Y. (ed.); Vieu, P. (ed.), Variable selection in principal component analysis, 265-283, (2007), Berlin
[20] Overall, JE; Gorham, DR, The brief psychatric rating scale, Psychol Rep, 10, 799-812, (1962)
[21] Rubin DB (1987) Multiple imputation for nonresponse in survey. Wiley, New York
[22] Schafer JL (1997) Analysis of incomplete multivariate data. Wiley, New York · Zbl 0997.62510
[23] Segi M (1979) Age-adjusted death rates for cancer for selected sites (A-classification) in 51 countries in 1974. Segi Institute of Cancer Epidemiology, Nagoya
[24] Serneels, S.; Verdonck, T., Principal component analysis for data containing outliers and missing elements, Comput Stat Data Anal, 52, 1712-1727, (2008) · Zbl 1452.62419
[25] Shibayama, T., A linear composite method for test scores with missing values, Mem Faulty Educ Niigata Univ, 36, 445-455, (1995)
[26] Stanimirova, I.; Daszykowski, M.; Walczak, B., Dealing with missing values and outliers in principal component analysis, Talanta, 72, 172-178, (2008)
[27] Takane Y (2013) Constrained principal component anlysis and related techniques. Chapman and Hall/CRC Press, Boca Raton
[28] Takane, Y.; Oshima-Takane, Y., Relationships between two methods for dealing with missing data in principal component analysis, Behaviometrika, 30, 145-154, (2003) · Zbl 1055.62070
[29] Tanner, MA; Wong, WH, The calculation of posterier distributions by data augumentation (with discussion), J Am Stat Assoc, 82, 528-550, (1987)
[30] Tipping, ME; Bishop, CM, Probabilistic principal component analysis, J R Stat Soc B, 61, 611-622, (1999) · Zbl 0924.62068
[31] Tucker L R (1951) A method of synthesis of factor analysis studies. Personnel Research Section Report No. 984, U. S. Department of Army, Wasgington, DC
[32] Ginkel, JR; Kroonenberg, PM, Using generalized procrustes analysis for multiple imputation in principal component analysis, J Classif, 31, 242-269, (2014) · Zbl 1360.62307
[33] Ginkel, JR; Kroonenberg, PM; Kiers, HAL, Missing data in principal component analysis of questionnaire data, J Stat Comput Sim, 84, 2298-2315, (2014)
[34] Walczak, B.; Massart, DL, Dealing with missing data, part 1, Chemom Intell Lab, 58, 15-27, (2001)
[35] Wentzell, PD; Andrews, DT; Hamilton, DC; Faber, K.; Kowalski, BR, Maximum likelihood principal component analysis, J Chemom, 11, 339-366, (1997)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.