zbMATH — the first resource for mathematics

Identifying the informational/signal dimension in principal component analysis. (English) Zbl 1407.62215
Summary: The identification of a reduced dimensional representation of the data is among the main issues of exploratory multidimensional data analysis and several solutions had been proposed in the literature according to the method. Principal Component Analysis (PCA) is the method that has received the largest attention thus far and several identification methods – the so-called stopping rules – have been proposed, giving very different results in practice, and some comparative study has been carried out. Some inconsistencies in the previous studies led us to try to fix the distinction between signal from noise in PCA – and its limits – and propose a new testing method. This consists in the production of simulated data according to a predefined eigenvalues structure, including zero-eigenvalues. From random populations built according to several such structures, reduced-size samples were extracted and to them different levels of random normal noise were added. This controlled introduction of noise allows a clear distinction between expected signal and noise, the latter relegated to the non-zero eigenvalues in the samples corresponding to zero ones in the population. With this new method, we tested the performance of ten different stopping rules. Of every method, for every structure and every noise, both power (the ability to correctly identify the expected dimension) and type-I error (the detection of a dimension composed only by noise) have been measured, by counting the relative frequencies in which the smallest non-zero eigenvalue in the population was recognized as signal in the samples and that in which the largest zero-eigenvalue was recognized as noise, respectively. This way, the behaviour of the examined methods is clear and their comparison/evaluation is possible. The reported results show that both the generalization of the Bartlett’s test by Rencher and the bootstrap method by Pillar result much better than all others: both are accounted for reasonable power, decreasing with noise, and very good type-I error. Thus, more than the others, these methods deserve being adopted.
62H25 Factor analysis and principal components; correspondence analysis
60G40 Stopping times; optimal stopping problems; gambling theory
bootstrap; Canoco; sedaR
Full Text: DOI
[1] Gnanadesikan, R.; Kettenring, J.; Robust estimates, residuals, and outlier detection with multiresponse data; Biometrics: 1972; Volume 28 ,81-124.
[2] Jolliffe, I.; ; Principal Component Analysis: Berlin, Germany 2002; . · Zbl 1011.62064
[3] Rencher, A.C.; ; Methods of Multivariate Analysis: New York, NY, USA 2002; . · Zbl 0995.62056
[4] Lebart, L.; Piron, M.; Morineau, A.; ; Statistique Exploratoire Multidimensionnelle—Visualisation et Inférence en Fouilles de Données: Paris, France 2016; .
[5] Guttman, L.; Some necessary conditions for common-factor analysis; Psychometrika: 1954; Volume 19 ,149-161. · Zbl 0058.13004
[6] Jolliffe, I.T.; Discarding Variables in a Principal Component Analysis. I: Artificial Data; Appl. Stat.: 1972; Volume 21 ,160-173.
[7] Cattell, R.B.; The scree test for the number of factors; Multivar. Behav. Res.: 1966; Volume 1 ,245-276.
[8] Jackson, D.A.; Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches; Ecology: 1993; Volume 74 ,2204-2214.
[9] Peres-Neto, P.R.; Jackson, D.A.; Somers, K.M.; How many principal components? stopping rules for determining the number of non-trivial axes revisited; Comput. Stat. Data Anal.: 2005; Volume 49 ,974-997. · Zbl 1429.62223
[10] Frontier, S.; Étude de la décroissance des valeurs propres dans une analyse en composantes principales: Comparaison avec le modèle du bâton brisé; J. Exp. Mar. Biol. Ecol.: 1976; Volume 25 ,67-75.
[11] Legendre, P.; Legendre, L.; ; Numerical Ecology: Amsterdam, NY, USA 1998; .
[12] Caron, P.O.; A Monte Carlo examination of the broken-stick distribution to identify components to retain in principal component analysis; J. Stat. Comput. Simul.: 2016; Volume 86 ,2405-2410.
[13] Bartlett, M.S.; A note on the multiplying factors for various χ 2 approximations; J. R. Stat. Soc. Ser. B Math.: 1954; Volume 16 ,296-298. · Zbl 0057.35404
[14] Wold, S.; Cross-validatory estimation of the number of components in factor and principal components models; Technometrics: 1978; Volume 20 ,397-405. · Zbl 0403.62032
[15] Eastment, H.; Krzanowski, W.; Cross-validatory choice of the number of components from a principal component analysis; Technometrics: 1982; Volume 24 ,73-77.
[16] Minka, T.P.; Automatic choice of dimensionality for PCA; Proceedings of the 13th International Conference on Neural Information Processing Systems: ; ,598-604.
[17] Auer, P.; Gervini, D.; Choosing principal components: A new graphical method based on Bayesian model selection; Commun. Stat. Simul. Comput.: 2008; Volume 37 ,962-977. · Zbl 1160.62334
[18] Wang, M.; Kornblau, S.M.; Coombes, K.R.; Decomposing the Apoptosis Pathway into Biologically Interpretable Principal Components; Cancer Inform.: 2017; Volume 17 .
[19] Pillar, V.D.; The bootstrapped ordination re-examined; J. Veg. Sci.: 1999; Volume 10 ,895-902.
[20] Vieira, V.M.; Permutation tests to estimate significances on Principal Components Analysis; Comput. Ecol. Softw.: 2012; Volume 2 ,103-123.
[21] Camiz, S.; Pillar, V.D.; Comparison of Single and Complete Linkage Clustering with the Hierarchical Factor Classification of Variables; Community Ecol.: 2007; Volume 8 ,25-30.
[22] Feoli, E.; Zuccarello, V.; Fuzzy Sets and Eigenanalysis in Community Studies: Classification and Ordination are “Two Faces of the Same Coin”; Community Ecol.: 2013; Volume 14 ,164-171.
[23] Jolliffe, I.T.; A note on the use of principal components in regression; J. R. Stat. Soc. Ser. C Appl. Stat.: 1982; Volume 31 ,300-303.
[24] Céréghino, R.; Pillar, V.; Srivastava, D.; de Omena, P.M.; MacDonald, A.A.M.; Barberis, I.M.; Corbara, B.; Guzman, L.M.; Leroy, C.; Bautista, F.O.; Constraints on the Functional Trait Space of Aquatic Invertebrates in Bromeliads; Funct. Ecol.: 2018; Volume 32 ,2435-2447.
[25] Ferré, L.; Selection of components in principal component analysis: A comparison of methods; Comput. Stat. Data Anal.: 1995; Volume 19 ,669-682.
[26] Dray, S.; On the number of principal components: A test of dimensionality based on measurements of similarity between matrices; Comput. Stat. Data Anal.: 2008; Volume 52 ,2228-2237. · Zbl 05564631
[27] Karr, J.; Martin, T.; Random number and principal components: Further searches for the unicorn; The Use of Multivariate Statistics in Wildlife Habitat: Washington, DC, USA 1981; ,20-24.
[28] Gauch, H.G.J.; Reduction by Eigenvector Ordinations; Ecology: 1982; Volume 63 ,1643-1649.
[29] Jackson, D.A.; Somers, K.M.; Harvey, H.H.; Null models and fish communities: Evidence of nonrandom patterns; Am. Nat.: 1992; Volume 139 ,930-951.
[30] Abdi, H.; Singular Value Decomposition (SVD) and Generalized Singular Value Decomposition (GSVD); Encyclopedia of Measurement and Statistics: Thousand Oaks, CA, USA 2007; .
[31] Eckart, C.; Young, G.; The approximation of one matrix by another of lower rank; Psychometrika: 1936; Volume 1 ,211-218. · JFM 62.1075.02
[32] Basilevsky, A.; ; Statistical Factor Analysis and Related Methods: Theory and Applications: New York, NY, USA 1994; . · Zbl 1130.62341
[33] Malinvaud, E.; Data analysis in applied socio-economic statistics with special consideration of correspondence analysis; Proceedings of the Academy of Marketing Science (AMS) Annual Conference: ; .
[34] Ben Ammou, S.; Saporta, G.; On the connection between the distribution of eigenvalues in multiple correspondence analysis and log-linear models; Revstat Stat. J.: 2003; Volume 1 ,42-79. · Zbl 1057.62043
[35] Wishart, J.; The Generalised Product Moment Distribution in Samples from a Normal Multivariate Population; Biometrika: 1928; Volume 20 ,32-52. · JFM 54.0565.02
[36] Anderson, T.; Asymptotic Theory for Principal Component Analysis; Ann. Math. Stat.: 1963; Volume 34 ,122-148. · Zbl 0202.49504
[37] Jackson, J.E.; ; A User’s Guide to Principal Components: New York, NY, USA 1991; . · Zbl 0743.62047
[38] Efron, B.; Bootstrap methods: Another look at jackknife; Ann. Stat.: 1979; Volume 7 ,1-26. · Zbl 0406.62024
[39] Manly, B.F.; ; Randomization, Bootstrap and Monte Carlo Methods in Biology: Boca Raton, FL, USA 2007; . · Zbl 1269.62076
[40] Efron, B.; Tibshirani, R.; ; An Introduction to the Bootstrap: New York, NY, USA 1993; . · Zbl 0835.62038
[41] Barton, D.; David, F.; Some notes on ordered random intervals; J. R. Stat. Soc. Ser. B Methodol.: 1956; Volume 18 ,79-94. · Zbl 0071.34802
[42] Cangelosi, R.; Goriely, A.; Component retention in principal component analysis with application to cDNA microarray data; Biol. Direct: 2007; Volume 2 ,1-21.
[43] Jost, L.; Entropy and diversity; Oikos: 2006; Volume 113 ,363-375.
[44] Ter Braak, C.J.; ; CANOCO—A FORTRAN Program for Canonical Community Ordination by [Partial][Detrended][Canonical] Correspondence Analysis, Principal Components Analysis and Redundancy Analysis (Version 2.1): Wageningen, The Netherlands 1988; .
[45] Ter Braak, C.J.; ; CANOCO Version 3.1, Update Notes: Wageningen, The Netherlands 1990; .
[46] Escoufier, Y.; Le Traitement des Variables Vectorielles; Biometrics: 1973; Volume 29 ,751-760.
[47] Robert, P.; Escoufier, Y.; A Unifying Tool for Linear Multivariate Statistical Methods: The RV-Coefficient; Appl. Stat.: 1976; Volume 25 ,257-265.
[48] Josse, J.; Pagès, J.; Husson, F.; Testing the significance of the RV coefficient; Comput. Stat. Data Anal.: 2008; Volume 53 ,82-91. · Zbl 05565114
[49] Schönemann, P.H.; Carroll, R.M.; Fitting one matrix to another under choice of a central dilation and a rigid motion; Psychometrika: 1970; Volume 35 ,245-255.
[50] Pillar, V.D.; Sampling sufficiency in ecological surveys; Abstr. Bot.: 1998; Volume 22 ,37-48.
[51] Stapleton, J.; ; Linear Statistical Models: New York, NY, USA 1995; . · Zbl 0854.62059
[52] Camacho, J.; Ferrer, A.; Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: Theoretical aspects; J. Chemom.: 2012; Volume 26 ,361-373.
[53] Camacho, J.; Ferrer, A.; Cross-validation in PCA models with the element-wise k-fold (ekf) algorithm: Practical aspects; Chemom. Intell. Lab. Syst.: 2014; Volume 131 ,37-50.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.