Eigenvector-based sparse canonical correlation analysis: fast computation for estimation of multiple canonical vectors. (English) Zbl 1476.62117

Summary: Classical canonical correlation analysis (CCA) requires matrices to be low dimensional, i.e. the number of features cannot exceed the sample size. Recent developments in CCA have mainly focused on the high-dimensional setting, where the number of features in both matrices under analysis greatly exceeds the sample size. These approaches impose penalties in the optimization problems that are needed to be solve iteratively, and estimate multiple canonical vectors sequentially. In this work, we provide an explicit link between sparse multiple regression with sparse canonical correlation analysis, and an efficient algorithm that can estimate multiple canonical pairs simultaneously rather than sequentially. Furthermore, the algorithm naturally allows parallel computing. These properties make the algorithm much efficient. We provide theoretical results on the consistency of canonical pairs. The algorithm and theoretical development are based on solving an eigenvectors problem, which significantly differentiate our method with existing methods. Simulation results support the improved performance of the proposed approach. We apply eigenvector-based CCA to analysis of the GTEx thyroid histology images, analysis of SNPs and RNA-seq gene expression data, and a microbiome study. The real data analysis also shows improved performance compared to traditional sparse CCA.


62H20 Measures of association (correlation, canonical correlation, etc.)
62H12 Estimation in multivariate analysis
62J07 Ridge regression; shrinkage estimators (Lasso)
62P10 Applications of statistics to biology and medical sciences; meta analysis
92D20 Protein sequences, DNA sequences


RGCCA; scca; PMA; glmnet; CCA; Ebimage; GPLP
Full Text: DOI arXiv


[1] Aguet, F.; Barbeira, A. N.; Bonazzola, R.; Brown, A.; Castel, S. E.; Jo, B.; Kasela, S.; Kim-Hellmuth, S.; Liang, Y.; Oliva, M., The GTEx consortium atlas of genetic regulatory effects across human tissues, BioRxiv, Article 787903 pp. (2019)
[2] Barry, J. D.; Fagny, M.; Paulson, J. N.; Aerts, H. J.; Platig, J.; Quackenbush, J., Histopathological image QTL discovery of immune infiltration variants, IScience, 5, 80-89 (2018)
[3] Chen, J.; Bushman, F. D.; Lewis, J. D.; Wu, G. D.; Li, H., Structure-constrained sparse canonical correlation analysis with an application to microbiome data analysis, Biostatistics, 14, 2, 244-258 (2012)
[4] Chen, M.; Gao, C.; Ren, Z.; Zhou, H. H., Sparse CCA via precision adjusted iterative thresholding (2013), ArXiv Preprint arXiv:1311.6186 · Zbl 1432.62161
[5] Cserháti, T.; Kósa, A.; Balogh, S., Comparison of partial least-square method and canonical correlation analysis in a quantitative structure-retention relationship study, J. Biochem. Biophys. Methods, 36, 2-3, 131-141 (1998)
[6] Friedman, J.; Hastie, T.; Tibshirani, R., Regularization paths for generalized linear models via coordinate descent, J. Stat. Softw., 33, 1, 1 (2010)
[7] Gallins, P.; Saghapour, E.; Zhou, Y.-H., Exploring the limits of combined image/‘omics analysis for non-cancer histological phenotypes, Front. Genet. (2020)
[8] Gao, C.; Ma, Z.; Zhou, H. H., Sparse CCA: Adaptive estimation and computational barriers, Ann. Statist., 45, 5, 2074-2101 (2017) · Zbl 1421.62073
[9] Gao, L.; Qi, L.; Chen, E.; Guan, L., Discriminative multiple canonical correlation analysis for information fusion, IEEE Trans. Image Process., 27, 4, 1951-1965 (2017) · Zbl 1409.94168
[10] Glahn, H. R., Canonical correlation and its relationship to discriminant analysis and multiple regression, J. Atmos. Sci., 25, 1, 23-31 (1968)
[11] González, I.; Déjean, S.; Martin, P. G.; Baccini, A., CCA: An r package to extend canonical correlation analysis, J. Stat. Softw., 23, 12, 1-14 (2008)
[12] Grellmann, C.; Bitzer, S.; Neumann, J.; Westlye, L. T.; Andreassen, O. A.; Villringer, A.; Horstmann, A., Comparison of variants of canonical correlation analysis and partial least squares for combined analysis of MRI and genetic data, Neuroimage, 107, 289-310 (2015)
[13] A. Haghighi, P. Liang, T. Berg-Kirkpatrick, D. Klein, Learning bilingual lexicons from monolingual corpora, in: Proceedings of ACL-08: Hlt, 2008, pp. 771-779.
[14] Hardoon, D. R.; Shawe-Taylor, J., Sparse canonical correlation analysis, Mach. Learn., 83, 3, 331-353 (2011) · Zbl 1237.68148
[15] Horn, R. A.; Johnson, C. R., Matrix Analysis (2012), Cambridge University Press: Cambridge University Press Cambridge
[16] Hotelling, H., Relations between two sets of variates, Biometrika (1936) · JFM 62.0618.04
[17] Jordan, M. I., On statistics, computation and scalability, Bernoulli, 19, 4, 1378-1390 (2013) · Zbl 1273.62030
[18] Lê Cao, K.-A.; Martin, P. G.; Robert-Granié, C.; Besse, P., Sparse canonical methods for biological data integration: Application to a cross-platform study, BMC Bioinformatics, 10, 1, 34 (2009)
[19] Lee, W.; Lee, D.; Lee, Y.; Pawitan, Y., Scca: Sparse canonical covariance analysis (2011), https://rdrr.io/github/tomwhoooo/scca_3.0/man/scca-package.html, R package version 1.1.1
[20] Lee, W.; Lee, D.; Lee, Y.; Pawitan, Y., Sparse canonical covariance analysis for high-throughput data, Stat. Appl. Genet. Mol. Biol., 10, 1 (2011) · Zbl 1296.92045
[21] Li, W.; Zhang, S.; Liu, C.-C.; Zhou, X. J., Identifying multi-layer gene regulatory modules from multi-dimensional genomic data, Bioinformatics, 28, 19, 2458-2466 (2012)
[22] Lutz, J. G.; Eckert, T. L., The relationship between canonical correlation analysis and multivariate multiple regression, Educ. Psychol. Meas., 54, 3, 666-675 (1994)
[23] Mai, Q.; Zhang, X., An iterative penalized least squares approach to sparse canonical correlation analysis, Biometrics, 75, 3, 734-744 (2019) · Zbl 1436.62598
[24] Mardia, K. V.; Kent, J. T.; Bibby, J. M., Multivariate Analysis (1979), Academic Press, London · Zbl 0432.62029
[25] Moll, R.; Divo, M.; Langbein, L., The human keratins: Biology and pathology, Histochem. Cell Biol., 129, 6, 705 (2008)
[26] Ning, Y.; Liu, H., A general theory of hypothesis tests and confidence regions for sparse high dimensional models, Ann. Statist., 45, 1, 158-195 (2017) · Zbl 1364.62128
[27] Park, C.; Huang, J. Z.; Ding, Y., Gplp: a local and parallel computation toolbox for Gaussian process regression, J. Mach. Learn. Res., 13, 775-779 (2012) · Zbl 1283.68297
[28] Parkhomenko, E.; Tritchler, D.; Beyene, J., Genome-wide sparse canonical correlation of gene expression with genotypes, (BMC Proceedings, Vol. 1 (2007), Springer), 1-5
[29] Parkhomenko, E.; Tritchler, D.; Beyene, J., Sparse canonical correlation analysis with application to genomic data integration, Stat. Appl. Genet. Mol. Biol., 8, 1, 1-34 (2009) · Zbl 1276.92071
[30] Pau, G.; Fuchs, F.; Sklyar, O.; Boutros, M.; Huber, W., EBImage—an R package for image processing with applications to cellular phenotypes, Bioinformatics, 26, 7, 979-981 (2010)
[31] Peng, C.-Y.; Wu, C. J., On the choice of nugget in kriging modeling for deterministic computer experiments, J. Comput. Graph. Statist., 23, 1, 151-168 (2014)
[32] Samarov, D.; Marron, J.; Liu, Y.; Grulke, C.; Tropsha, A., Local kernel canonical correlation analysis with application to virtual drug screening, Ann. Appl. Stat., 5, 3, 2169 (2011) · Zbl 1228.62072
[33] Sargin, M. E.; Yemez, Y.; Erzin, E.; Tekalp, A. M., Audiovisual synchronization and fusion using canonical correlation analysis, IEEE Trans. Multimed., 9, 7, 1396-1403 (2007)
[34] Sass, J. O., Inborn errors of ketogenesis and ketone body utilization, J. Inherit. Metab. Dis., 35, 1, 23-28 (2012)
[35] Shu, H.; Qu, Z.; Zhu, H., D-GCCA: Decomposition-based generalized canonical correlation analysis for multiple high-dimensional datasets (2020), ArXiv Preprint arXiv:2001.02856
[36] Shu, H.; Wang, X.; Zhu, H., D-CCA: A decomposition-based canonical correlation analysis for high-dimensional datasets, J. Amer. Statist. Assoc., 115, 529, 292-306 (2020) · Zbl 1437.62211
[37] Song, Y.; Schreier, P. J.; Ramírez, D.; Hasija, T., Canonical correlation analysis of high-dimensional data with very small sample support, Signal Process., 128, 449-458 (2016)
[38] Stein, M. L., Interpolation of Spatial Data: Some Theory for Kriging (1999), Springer Science & Business Media: Springer Science & Business Media New York · Zbl 0924.62100
[39] Stewart, G. W., Error and perturbation bounds for subspaces associated with certain eigenvalue problems, SIAM Rev., 15, 4, 727-764 (1973) · Zbl 0297.65030
[40] Suchard, M. A.; Wang, Q.; Chan, C.; Frelinger, J.; Cron, A.; West, M., Understanding GPU programming for statistical computation: Studies in massively parallel massive mixtures, J. Comput. Graph. Statist., 19, 2, 419-438 (2010)
[41] Sun, L.; Ji, S.; Yu, S.; Ye, J., On the equivalence between canonical correlation analysis and orthonormalized partial least squares, (IJCAI, Vol. 9 (2009)), 1230-1235
[42] Tenenhaus, A.; Guillemot, V., RGCCA: Regularized and sparse generalized canonical correlation analysis for multiblock data (2017), https://CRAN.R-project.org/package=RGCCA, R package version 2.1.2
[43] Tenenhaus, A.; Tenenhaus, M., Regularized generalized canonical correlation analysis, Psychometrika, 76, 2, 257 (2011) · Zbl 1284.62753
[44] Tibshirani, R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B Stat. Methodol., 58, 1, 267-288 (1996) · Zbl 0850.62538
[45] Van Loan, C. F.; Golub, G. H., Matrix Computations (1983), Johns Hopkins University Press: Johns Hopkins University Press Baltimore · Zbl 0559.65011
[46] Vinokourov, A.; Cristianini, N.; Shawe-Taylor, J., Inferring a semantic representation of text via cross-language correlation analysis, (Advances in Neural Information Processing Systems (2003)), 1497-1504
[47] Waaijenborg, S.; de Witt Hamer, P. C.V.; Zwinderman, A. H., Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis, Stat. Appl. Genet. Mol. Biol., 7, 1, Article 3 pp. (2008) · Zbl 1276.92077
[48] Wang, Y. R.; Jiang, K.; Feldman, L. J.; Bickel, P. J.; Huang, H., Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis, Ann. Appl. Stat., 9, 1, 300-323 (2015) · Zbl 1454.62416
[49] Witten, D. M.; Tibshirani, R. J., Extensions of sparse canonical correlation analysis with applications to genomic data, Stat. Appl. Genet. Mol. Biol., 8, 1, 1-27 (2009) · Zbl 1276.62099
[50] Witten, D.; Tibshirani, R., PMA: Penalized multivariate analysis (2020), https://CRAN.R-project.org/package=PMA, R package version 1.2.1
[51] Witten, D. M.; Tibshirani, R.; Hastie, T., A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis, Biostatistics, 10, 3, 515-534 (2009) · Zbl 1437.62658
[52] Yamamoto, H.; Yamaji, H.; Fukusaki, E.; Ohno, H.; Fukuda, H., Canonical correlation analysis for multivariate regression and its application to metabolic fingerprinting, Biochem. Eng. J., 40, 2, 199-204 (2008)
[53] Yazici, A. C.; Öğüş, E.; Ankarali, H.; Gürbüz, F., An application of nonlinear canonical correlation analysis on medical data, Turkish J. Med. Sci., 40, 3, 503-510 (2010)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.