×

Principal component analysis for data containing outliers and missing elements. (English) Zbl 1452.62419

Summary: Two approaches are presented to perform principal component analysis (PCA) on data which contain both outlying cases and missing elements. At first an eigendecomposition of a covariance matrix which can deal with such data is proposed, but this approach is not fit for data where the number of variables exceeds the number of cases. Alternatively, an expectation robust (ER) algorithm is proposed so as to adapt the existing methodology for robust PCA to data containing missing elements. According to an extensive simulation study, the ER approach performs well for all data sizes concerned. Using simulations and an example, it is shown that by virtue of the ER algorithm, the properties of the existing methods for robust PCA carry through to data with missing elements.

MSC:

62H25 Factor analysis and principal components; correspondence analysis
62F35 Robustness and adaptive procedures (parametric inference)
62D10 Missing data
62-08 Computational methods for problems pertaining to statistics

Software:

ROBPCA; LIBRA; TOMCAT
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Cheng, T.-S.; Victoria-Feser, M.-P., High-breakdown estimation of multivariate mean and covariance with missing observations, British J. Math. Statist. Psych., 55, 317-335 (2002)
[2] Copt, S.; Victoria-Feser, M.-P., Fast algorithms for computing high breakdown covariance matrices with missing data. Cahiers du département d’économétrie août 2003. Faculté des sciences économiques et sociales (2003), Université de Genève: Université de Genève Geneva, Switzerland
[3] Croux, C., Efficient high-breakdown M-estimators of scale, Statist. Prob. Lett., 19, 371-379 (1994) · Zbl 0791.62034
[4] Croux, C.; Haesbroeck, G., Influence function and efficiency of the minimum covariance determinant scatter matrix estimator, J. Multivariate Anal., 71, 161-190 (1999) · Zbl 0946.62055
[5] Croux, C., Ruiz-Gazen, A., 1996. A fast algorithm for robust principal components based on projection pursuit. In: Prat, A. (Ed.), COMPSTAT: Proceedings in Computational Statistics. Physica, Heidelberg, pp. 211-216.; Croux, C., Ruiz-Gazen, A., 1996. A fast algorithm for robust principal components based on projection pursuit. In: Prat, A. (Ed.), COMPSTAT: Proceedings in Computational Statistics. Physica, Heidelberg, pp. 211-216. · Zbl 0900.62300
[6] Croux, C.; Ruiz-Gazen, A., High breakdown estimators for principal components: the projection-pursuit approach revisited, J. Multivariate Anal., 95, 206-226 (2005) · Zbl 1065.62040
[7] Cui, H.; He, X.; Ng, K. W., Asymptotic distributions of principal components based on robust dispersions, Biometrika, 90, 953-966 (2003) · Zbl 1436.62222
[8] Daszykowski, M., Serneels, S., Kaczmarek, K., Van Espen, P.J., Croux, C., Walczak, B, 2007. TOMCAT: a MATLAB toolbox for multivariate calibration techniques. Chemometr. Intell. Lab. Syst. 85, 269-277.; Daszykowski, M., Serneels, S., Kaczmarek, K., Van Espen, P.J., Croux, C., Walczak, B, 2007. TOMCAT: a MATLAB toolbox for multivariate calibration techniques. Chemometr. Intell. Lab. Syst. 85, 269-277.
[9] Davies, L. P.; Gather, U., Breakdown and groups, Ann. Statist., 33, 977-1035 (2005) · Zbl 1077.62041
[10] Debruyne, M., Hubert, M., 2007. The influence function of Stahel-Donoho type methods for robust covariance estimation and PCA. Scand. J. Statist., submitted for publication.; Debruyne, M., Hubert, M., 2007. The influence function of Stahel-Donoho type methods for robust covariance estimation and PCA. Scand. J. Statist., submitted for publication.
[11] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood for incomplete data via the EM algorithm (with discussions), J. Roy. Statist. Soc. Ser. B, 39, 1-38 (1977) · Zbl 0364.62022
[12] Engelen, S.; Hubert, M.; Vanden Branden, K., A comparison of three procedures for robust PCA in high dimensions, Austr. J. Statist., 34, 117-126 (2005)
[13] Garamszegi, L. G.; Heylen, D.; Møller, A. P.; Eens, M.; de Lope, F., Age-dependent health status and song characteristics in the barn swallow, Behavioral Ecology, 16, 580-591 (2005)
[14] Grize, Y. L., Robustheitseigenschaften von Korrelationsschätzungen. Diplomarbeit, Eidgenössische Technische Hochschule (ETH) (1978), Zürich: Zürich Switzerland
[15] Hampel, F. R.; Ronchetti, E. M.; Rousseeuw, P. J.; Stahel, W. A., Robust Statistics: The Approach Based on Influence Functions (1986), Wiley: Wiley New York · Zbl 0593.62027
[16] Huber, P., Projection pursuit, Ann. Statist., 13, 435-475 (1985) · Zbl 0595.62059
[17] Hubert, M.; Rousseeuw, P. J.; Verboven, S., A fast method for robust principal components with applications to chemometrics, Chemometr. Intell. Lab. Syst., 60, 101-111 (2002)
[18] Hubert, M.; Rousseeuw, P. J.; Vanden Branden, K., ROBPCA: a new approach to robust principal components analysis, Technometrics, 47, 64-79 (2005)
[19] Krzanowski, W. J., Between-groups comparison of principal components, J. Amer. Statist. Assoc., 74, 703-707 (1979) · Zbl 0459.62042
[20] Lax, D. A., Robust estimators of scale: finite-sample performance in long-tailed symmetric distributions, J. Amer. Statist. Assoc., 80, 736-741 (1985)
[21] Li, G.; Chen, Z., Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and Monte Carlo, J. Amer. Statist. Assoc., 80, 759-766 (1985) · Zbl 0595.62060
[22] Little, R. J.A., Robust estimation of the mean and covariance matrix from data with missing values, Appl. Statist., 37, 23-38 (1988) · Zbl 0647.62040
[23] Locantore, N.; Marron, J. S.; Simpson, D. G.; Tripoli, N.; Zhang, J. T.; Cohen, K. L., Principal component analysis for functional data, Test, 8, 1-73 (1998) · Zbl 0980.62049
[24] Maronna, R., Principal components and orthogonal regression based on robust scales, Technometrics, 47, 264-273 (2005)
[25] Pearson, K., On lines and planes of closest fit to systems of points in space, Philos. Mag., 2, 559-572 (1901) · JFM 32.0246.07
[26] Rousseeuw, P.J., 1985. Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W., (Eds.), Mathematical Statistics and Applications, vol. B. Reidel, Dordrecht, pp. 283-297.; Rousseeuw, P.J., 1985. Multivariate estimation with high breakdown point. In: Grossmann, W., Pflug, G., Vincze, I., Wertz, W., (Eds.), Mathematical Statistics and Applications, vol. B. Reidel, Dordrecht, pp. 283-297.
[27] Rousseeuw, P.J., 1999. Maxbias Curve. In: Kotz, S., Read, C., Banks, D., (Eds.), Encyclopedia of Statistical Sciences, Update vol. 3. Wiley, New York, pp. 441-443.; Rousseeuw, P.J., 1999. Maxbias Curve. In: Kotz, S., Read, C., Banks, D., (Eds.), Encyclopedia of Statistical Sciences, Update vol. 3. Wiley, New York, pp. 441-443.
[28] Rousseeuw, P. J.; Croux, C., Alternatives to the median absolute deviation, J. Amer. Statist. Assoc., 88, 1273-1283 (1994) · Zbl 0792.62025
[29] Rousseeuw, P. J.; Leroy, A. M., Robust Regression and Outlier Detection (1987), Wiley: Wiley New York · Zbl 0711.62030
[30] Rousseeuw, P.J., Yohai, V.J., 1984. Robust regression by means of S-estimators. In: Franke, J.W., Hardle, P.J., Martin, R.D., (Eds.), Robust and Nonlinear Time Series Analysis. Springer, New York, pp. 256-272.; Rousseeuw, P.J., Yohai, V.J., 1984. Robust regression by means of S-estimators. In: Franke, J.W., Hardle, P.J., Martin, R.D., (Eds.), Robust and Nonlinear Time Series Analysis. Springer, New York, pp. 256-272. · Zbl 0567.62027
[31] Rubin, D. B., Inference and missing data, Biometrika, 63, 581-592 (1976) · Zbl 0344.62034
[32] Serneels, S.; De Nolf, E.; Van Espen, P. J., Spatial sign pre-processing: a simple way to impart moderate robustness to multivariate estimators, J. Chem. Info. Model., 46, 1402-1409 (2006)
[33] Smilde, A. K.; Geladi, P.; Bro, R., Multi-way Analysis with Applications in the Chemical Sciences (2004), Wiley: Wiley Chichester, UK
[34] Stanimirova, I.; Walczak, B.; Massart, D. L.; Simeonov, V., A comparison between two robust PCA algorithms, Chemometr. Intell. Lab. Syst., 71, 83-95 (2004)
[35] Verboven, S.; Hubert, M., LIBRA: a MATLAB library for robust analysis, Chemometr. Intell. Lab. Syst., 75, 127-136 (2005)
[36] Walczak, B.; Massart, D. L., Dealing with missing data, Part I. Chemometr. Intell. Lab. Syst., 58, 15-27 (2001)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.