zbMATH — the first resource for mathematics

Permutation methods for factor analysis and PCA. (English) Zbl 07285315
From the author’s abstract: “Researchers often have datasets measuring features $$x_{ij}$$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many components affect the data? This is an important problem, because decisions made here have a large impact on all downstream data analysis. Consequently, many approaches have been developed. Parallel Analysis is a popular permutation method: it randomly scrambles each feature of the data. It selects components if their singular values are larger than those of the permuted data. Despite widespread use, as well as empirical evidence for its accuracy, it currently has no theoretical justification.”
In this paper, the problem is analyzed under the signal plus noise model. Sufficient conditions on the signal components and on the noise component are established to ensure the consistency of the parallel analysis. A simulation study supports the theoretical results and points out interesting features of the parallel analysis. In particular, the effect of signal strength, the effect of delocalization, and the effect of dimension are studied. Finally, it is shown that strong signals may lead to errors in the detection of weak signal components.
MSC:
 62H25 Factor analysis and principal components; correspondence analysis 62H12 Estimation in multivariate analysis 60G35 Signal detection and filtering (aspects of stochastic processes) 94A12 Signal theory (characterization, reconstruction, filtering, etc.)
Software:
ElemStatLearn; nFactors; OptShrink
Full Text:
References:
 [1] Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley Publications in Statistics. Wiley, New York; CRC Press, London. Zentralblatt MATH: 0083.14601 · Zbl 0083.14601 [2] Bai, Z. and Ding, X. (2012). Estimation of spiked eigenvalues in spiked models. Random Matrices Theory Appl. 1 1150011, 21. Zentralblatt MATH: 1251.15037 Digital Object Identifier: doi:10.1142/S2010326311500110 · Zbl 1251.15037 [3] Bai, J. and Ng, S. (2008). Large Dimensional Factor Analysis. Now Publishers, Hanover. [4] Bai, Z. and Silverstein, J. W. (2010). Spectral Analysis of Large Dimensional Random Matrices, 2nd ed. Springer Series in Statistics. Springer, New York. Zentralblatt MATH: 1301.60002 · Zbl 1301.60002 [5] Baik, J., Ben Arous, G. and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Ann. Probab. 33 1643-1697. Zentralblatt MATH: 1086.15022 Digital Object Identifier: doi:10.1214/009117905000000233 Project Euclid: euclid.aop/1127395869 · Zbl 1086.15022 [6] Bartlett, M. S. (1950). Tests of significance in factor analysis. Br. J. Math. Stat. Psychol. 3 77-85. [7] Benaych-Georges, F. and Nadakuditi, R. R. (2012). The singular values and vectors of low rank perturbations of large rectangular random matrices. J. Multivariate Anal. 111 120-135. Zentralblatt MATH: 1252.15039 Digital Object Identifier: doi:10.1016/j.jmva.2012.04.019 · Zbl 1252.15039 [8] Brown, T. A. (2014). Confirmatory Factor Analysis for Applied Research. Guilford, New York. [9] Buja, A. and Eyuboglu, N. (1992). Remarks on parallel analysis. Multivar. Behav. Res. 27 509-540. [10] Cattell, R. B. (1966). The scree test for the number of factors. Multivar. Behav. Res. 1 245-276. [11] Churchill, G. A. Jr. (1979). A paradigm for developing better measures of marketing constructs. J. Mark. Res. 64-73. [12] Costello, A. B. and Osborne, J. W. (2005). Best practices in exploratory factor analysis: Four recommendations for getting the most from your analysis. Pract. Assess., Res. Eval. 10 1-9. [13] Dobriban, E., Leeb, W. and Singer, A. (2017). Optimal prediction in the linearly transformed spiked model. Preprint. Available at arXiv:1709.03393. To appear in the Annals of Statistics. arXiv: 1709.03393 Zentralblatt MATH: 07196548 Digital Object Identifier: doi:10.1214/19-AOS1819 Project Euclid: euclid.aos/1581930144 · Zbl 1441.62158 [14] Dobriban, E. and Owen, A. B. (2019). Deterministic parallel analysis: An improved method for selecting factors and principal components. J. R. Stat. Soc. Ser. B. Stat. Methodol. 81 163-183. Zentralblatt MATH: 1407.62216 Digital Object Identifier: doi:10.1111/rssb.12301 · Zbl 1407.62216 [15] Dobriban, E. and Wager, S. (2018). High-dimensional asymptotics of prediction: Ridge regression and classification. Ann. Statist. 46 247-279. Zentralblatt MATH: 1428.62307 Digital Object Identifier: doi:10.1214/17-AOS1549 Project Euclid: euclid.aos/1519268430 · Zbl 1428.62307 [16] Fabrigar, L. R., Wegener, D. T., MacCallum, R. C. and Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychol. Methods 4 272. [17] Gaskin, C. J. and Happell, B. (2014). On exploratory factor analysis: A review of recent evidence, an assessment of current practice, and recommendations for future use. Int. J. Nurs. Stud. 51 511-521. [18] Gerard, D. and Stephens, M. (2017). Unifying and generalizing methods for removing unwanted variation based on negative controls. Preprint. Available at arXiv:1705.08393. arXiv: 1705.08393 [19] Glorfeld, L. W. (1995). An improvement on Horn’s parallel analysis methodology for selecting the correct number of factors to retain. Educ. Psychol. Meas. 55 377-393. [20] Green, S. B., Levy, R., Thompson, M. S., Lu, M. and Lo, W.-J. (2012). A proposed solution to the problem with using completely random data to assess the number of factors with parallel analysis. Educ. Psychol. Meas. 72 357-374. [21] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. Springer, New York. Zentralblatt MATH: 1273.62005 · Zbl 1273.62005 [22] Hayton, J. C., Allen, D. G. and Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organ. Res. Methods 7 191-205. [23] Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika 30 179-185. Zentralblatt MATH: 1367.62186 Digital Object Identifier: doi:10.1007/BF02289447 · Zbl 1367.62186 [24] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29 295-327. Zentralblatt MATH: 1016.62078 Digital Object Identifier: doi:10.1214/aos/1009210544 Project Euclid: euclid.aos/1009210544 · Zbl 1016.62078 [25] Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer Series in Statistics. Springer, New York. Zentralblatt MATH: 1011.62064 · Zbl 1011.62064 [26] Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educ. Psychol. Meas. 20 141-151. [27] Kritchman, S. and Nadler, B. (2008). Determining the number of components in a factor model from limited noisy data. Chemom. Intell. Lab. Syst. 94 19-32. [28] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 e161. [29] Leek, J. T. and Storey, J. D. (2008). A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. USA 105 18718-18723. Zentralblatt MATH: 1359.62202 Digital Object Identifier: doi:10.1073/pnas.0808709105 · Zbl 1359.62202 [30] Lin, Z., Yang, C., Zhu, Y. et al. (2016). Simultaneous dimension reduction and adjustment for confounding variation. Proc. Natl. Acad. Sci. USA 113 14662-14667. Zentralblatt MATH: 1407.62218 Digital Object Identifier: doi:10.1073/pnas.1617317113 · Zbl 1407.62218 [31] Nadakuditi, R. R. (2014). OptShrink: An algorithm for improved low-rank signal matrix denoising by optimal, data-driven singular value shrinkage. IEEE Trans. Inform. Theory 60 3002-3018. Zentralblatt MATH: 1360.62399 Digital Object Identifier: doi:10.1109/TIT.2014.2311661 · Zbl 1360.62399 [32] Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix perturbation approach. Ann. Statist. 36 2791-2817. Zentralblatt MATH: 1168.62058 Digital Object Identifier: doi:10.1214/08-AOS618 Project Euclid: euclid.aos/1231165185 · Zbl 1168.62058 [33] Onatski, A. (2009). Testing hypotheses about the numbers of factors in large factor models. Econometrica 77 1447-1479. Zentralblatt MATH: 1182.62180 Digital Object Identifier: doi:10.3982/ECTA6964 · Zbl 1182.62180 [34] Onatski, A. (2012). Asymptotics of the principal components estimator of large factor models with weakly influential factors. J. Econometrics 168 244-258. Zentralblatt MATH: 06714698 Digital Object Identifier: doi:10.1016/j.jeconom.2012.01.034 · Zbl 1443.62497 [35] Onatski, A., Moreira, M. J. and Hallin, M. (2013). Asymptotic power of sphericity tests for high-dimensional data. Ann. Statist. 41 1204-1231. Zentralblatt MATH: 1293.62125 Digital Object Identifier: doi:10.1214/13-AOS1100 Project Euclid: euclid.aos/1371150898 · Zbl 1293.62125 [36] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statist. Sinica 17 1617-1642. Zentralblatt MATH: 1134.62029 · Zbl 1134.62029 [37] Paul, D. and Aue, A. (2014). Random matrix theory in statistics: A review. J. Statist. Plann. Inference 150 1-29. Zentralblatt MATH: 1287.62011 Digital Object Identifier: doi:10.1016/j.jspi.2013.09.005 · Zbl 1287.62011 [38] Peres-Neto, P. R., Jackson, D. A. and Somers, K. M. (2005). How many principal components? Stopping rules for determining the number of non-trivial axes revisited. Comput. Statist. Data Anal. 49 974-997. Zentralblatt MATH: 1429.62223 Digital Object Identifier: doi:10.1016/j.csda.2004.06.015 · Zbl 1429.62223 [39] Quadeer, A. A., Louie, R. H., Shekhar, K., Chakraborty, A. K., Hsing, I.-M. and McKay, M. R. (2014). Statistical linkage analysis of substitutions in patient-derived sequences of genotype 1a hepatitis C virus nonstructural protein 3 exposes targets for immunogen design. J. Virol. 88 7628-7644. [40] Raiche, G., Magis, D. and Raiche, M. G. Package ‘nfactors’. 2010. [41] Saccenti, E. and Timmerman, M. E. (2017). Considering Horn’s parallel analysis from a random matrix theory point of view. Psychometrika 82 186-209. Zentralblatt MATH: 1360.62531 Digital Object Identifier: doi:10.1007/s11336-016-9515-z · Zbl 1360.62531 [42] Spearman, C. (1904). “General intelligence”, objectively determined and measured. Am. J. Psychol. 15 201-292. [43] Stewart, D. W. (1981). The application and misapplication of factor analysis in marketing research. J. Mark. Res. 51-62. [44] Thurstone, L. L. (1947). Multiple-factor analysis. University of Chicago Press, Chicago. Zentralblatt MATH: 0029.22203 · Zbl 0029.22203 [45] Velicer, W. F. (1976). Determining the number of components from the matrix of partial correlations. Psychometrika 41 321-327. Zentralblatt MATH: 0336.62041 Digital Object Identifier: doi:10.1007/BF02293557 · Zbl 0336.62041 [46] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices. In Compressed Sensing 210-268. Cambridge Univ. Press, Cambridge. [47] Yao, J., Zheng, S. and Bai, Z. (2015). Large Sample Covariance Matrices and High-Dimensional Data Analysis. Cambridge Series in Statistical and Probabilistic Mathematics 39. Cambridge Univ. Press, New York. [48] Zhou, Y.-H., Marron, J. S. and Wright, F. A. (2018). Eigenvalue significance testing for genetic association. Biometrics 74 439-447. Zentralblatt MATH: 1415.62151 Digital Object Identifier: doi:10.1111/biom.12767 · Zbl 1415.62151
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.