zbMATH — the first resource for mathematics

PCA consistency in high dimension, low sample size context. (English) Zbl 1191.62108
Summary: Principal Component Analysis (PCA) is an important tool of dimension reduction especially when the dimension (or the number of variables) is very high. Asymptotic studies where the sample size is fixed, and the dimension grows [i.e., High Dimension, Low Sample Size (HDLSS)] are becoming increasingly relevant. We investigate the asymptotic behavior of Principal Component (PC) directions. HDLSS asymptotics are used to study consistency, strong inconsistency and subspace consistency. We show that if the first few eigenvalues of a population covariance matrix are large enough compared to the others, then the corresponding estimated PC directions are consistent or converge to the appropriate subspace (subspace consistency) and most other PC directions are strongly inconsistent. Broad sets of sufficient conditions for each of these cases are specified and the main theorem gives a catalogue of possible combinations. In preparation for these results, we show that the geometric representation of HDLSS data holds under general conditions, which includes a \(\rho \)-mixing condition and a broad range of sphericity measures of the covariance matrix.

62H25 Factor analysis and principal components; correspondence analysis
34L20 Asymptotic distribution of eigenvalues, asymptotic theory of eigenfunctions for ordinary differential operators
62F12 Asymptotic properties of parametric estimators
15A18 Eigenvalues, singular values, and eigenvectors
Full Text: DOI arXiv
[1] Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions. Biometrika 94 760-766. · Zbl 1135.62039
[2] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643-1697. · Zbl 1086.15022
[3] Baik, J. and Silverstein, J. W. (2006). Eigenvalues of large sample covariance matrices of spiked population models. J. Multivariate Anal. 97 1382-1408. · Zbl 1220.15011
[4] Bhattacharjee, A., Richards, W., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., Loda, M., Weber, G., Mark, E. J., Lander, E. S., Wong, W., Johnson, B. E., Golub, T. R., Sugarbaker, D. J. and Meyerson, M. (2001). Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA 98 13790-13795.
[5] Bradley, R. C. (2005). Basic properties of strong mixing conditions. A survey and some open questions. Probab. Surv. 2 107-144 (electronic). (Update of, and a supplement to, the 1986 original.) · Zbl 1189.60077
[6] Eaton, M. L. and Tyler, D. E. (1991). On Wielandt’s inequality and its application to the asymptotic distribution of the eigenvalues of a random symmetric matrix. Ann. Statist. 19 260-271. · Zbl 0742.62015
[7] Gaydos, T. L. (2008). Data representation and basis selection to understand variation of function valued traits. Ph.D. thesis, Univ. North Carolina at Chapel Hill.
[8] Hall, P., Marron, J. S. and Neeman, A. (2005). Geometric representation of high dimension, low sample size data. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 427-444. JSTOR: · Zbl 1069.62097
[9] John, S. (1971). Some optimal multivariate tests. Biometrika 58 123-127. JSTOR: · Zbl 0218.62055
[10] John, S. (1972). The distribution of a statistic used for testing sphericity of normal distributions. Biometrika 59 169-173. JSTOR: · Zbl 0231.62072
[11] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295-327. · Zbl 1016.62078
[12] Johnstone, I. M. and Lu, A. Y. (2004). Sparse principal component analysis. Unpublished manuscript.
[13] Kato, T. (1995). Perturbation Theory for Linear Operators . Springer, Berlin. (Reprint of the 1980 edition.) · Zbl 0836.47009
[14] Kolmogorov, A. N. and Rozanov, Y. A. (1960). On strong mixing conditions for stationary Gaussian processes. Theory Probab. Appl. 5 204-208. · Zbl 0106.12005
[15] Liu, Y., Hayes, D. N., Nobel, A. and Marron, J. S. (2008). Statistical significance of clustering for high dimension low sample size data. J. Amer. Statist. Assoc. 103 1281-1293. · Zbl 1205.62079
[16] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617-1642. · Zbl 1134.62029
[17] Rao, C. R. (1973). Linear Statistical Inference and Its Applications , 2nd ed. Wiley, New York. · Zbl 0256.62002
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.