Selecting the number of components in principal component analysis using cross-validation approximations. (English) Zbl 1243.62082

Summary: Cross-validation is a tried and tested approach to select the number of components in principal components analysis (PCA), however, its main drawback is its computational cost. In a regression (or in a nonparametric regression) setting, criteria such as the general ross-validation (GCV) provide convenient approximations to leave-one-out cross-validation. They are based on the relation between the prediction error and the residual sum of squares weighted by elements of a projection matrix (or a smoothing matrix). Such a relation is then established in PCA using an original presentation of PCA with a unique projection matrix. It enables the definition of two cross-validation approximation criteria: the smoothing approximation of the cross-validation criterion (SACV) and the GCV criterion. The method is assessed with simulations and gives promising results.


62H25 Factor analysis and principal components; correspondence analysis
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI


[1] Besse, P.; Ferre, L., Sur l’usage de la validation croisée en analyse en composantes principales, Revue de statistique appliquée, 41, 71-76, (1993) · Zbl 0972.62511
[2] Bro, R.; Kjeldahl, K.; Smilde, A.K.; Kiers, H.A.L., Cross-validation of component model: a critical look at current methods, Analytical and bioanalytical chemistry, 390, 1241-1251, (2008)
[3] Candès, E.; Tao, T., The power of convex relaxation: near-optimal matrix completion, IEEE transactions on information theory, 56, 5, 2053-2080, (2009) · Zbl 1366.15021
[4] Caussinus, H., Models and uses of principal component analysis (with discussion), (), 149-178
[5] Craven, P.; Wahba, G., Smoothing noisy data with spline functions, Numerische Mathematik, 31, 377-403, (1979) · Zbl 0377.65007
[6] Dray, S., On the number of principal components: a test of dimensionality based on measurements of similarity between matrices, Computational statistics and data analysis, 52, 2228-2237, (2008) · Zbl 1452.62409
[7] Eastment, H.T.; Krzanowski, W.J., Cross-validatory choice of the number of components from a principal component analysis, Technometrics, 24, 73-77, (1982)
[8] Escoufier, Y., Le traitement des variables vectorielles, Biometrics, 29, 751-760, (1973)
[9] Ferre, L., Selection of components in principal component analysis. A comparison of methods, Computational statistics and data analysis, 19, 669-682, (1995) · Zbl 0875.62253
[10] Gabriel, K.R.; Zamir, S., Lower rank approximation of matrices by least squares with any choice of weights, Technometrics, 21, 236-246, (1979) · Zbl 0471.62004
[11] Hastie, T.; Tibshirani, R.; Friedman, J., The elements of statistical learning. data mining, inference and prediction, (2009), Springer · Zbl 1273.62005
[12] Husson, F., Josse, J., Le, S., Mazet, J., 2010. FactoMineR: multivariate exploratory data analysis and data mining with R. R Package Version 1.15. URL: http://factominer.free.fr.
[13] Jackson, D., Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches, Ecology, 74, 8, 2204-2214, (1993)
[14] Jolliffe, I.T., Principal component analysis, (2002), Springer · Zbl 1011.62064
[15] Josse, J.; Pagès, J.; Husson, F., Gestion des données manquantes en analyse en composantes principales, Journal de la société française de statistiques, 150, 28-51, (2009) · Zbl 1311.62091
[16] Josse, J.; Pagès, J.; Husson, F., Multiple imputation in principal component analysis, Advances in data analysis and classification, 6, 1-16, (2011)
[17] Ke, C.; Wang, Y., Smoothing spline nonlinear nonparametric regression models, Journal of the American statistical association, 99, 468, 1166-1175, (2004) · Zbl 1055.62043
[18] Kiers, H.A.L., Weighted least squares Fitting using ordinary least squares algorithms, Psychometrica, 62, 251-266, (1997) · Zbl 0873.62058
[19] O’Sullivan, F.; Wahba, G., A cross validated Bayesian retrieval algorithm for nonlinear remote sensing experiments, Journal of computational physics, 59, 3, 441-455, (1985) · Zbl 0626.65053
[20] Pazman, A.; Denis, J.B., Measures of nonlinearity for biadditive anova models, Metrika, 55, 233-245, (2002) · Zbl 1320.62176
[21] Peres-Neto, P.R.; Jackson, D.A.; Somers, K.M., How many principal components? stopping rules for determining the number of non-trivial axes revisited, Computational statistics and data analysis, 49, 974-997, (2005) · Zbl 1429.62223
[22] R Development Core Team, 2010. R: a language and environment for statistical computing. R foundation for statistical computing. Vienna, Austria. ISBN: 3-900051-07-0. URL: http://www.R-project.org.
[23] Sima, D., 2006. Regularization techniques in model fitting and parameter estimation. Ph.D. Thesis. Department ESAT, SCD-SISTA, Leuven-Heverlee, Belgium.
[24] Timmerman, M.E.; Kiers, H.A.L.; Smilde, A.K., Estimating confidence intervals for principal component loadings: a comparison between the bootstrap and asymptotic results, British journal of mathematical and statistical psychology, 60, 295-314, (2007)
[25] Wold, S., Cross-validatory estimation of the number of components in factor and principal components models, Technometrics, 20, 397-405, (1978) · Zbl 0403.62032
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.