CDPA: common and distinctive pattern analysis between high-dimensional datasets. (English) Zbl 07524978

Summary: A representative model in integrative analysis of two high-dimensional correlated datasets is to decompose each data matrix into a low-rank common matrix generated by latent factors shared across datasets, a low-rank distinctive matrix corresponding to each dataset, and an additive noise matrix. Existing decomposition methods claim that their common matrices capture the common pattern of the two datasets. However, their so-called common pattern only denotes the common latent factors but ignores the common pattern between the two coefficient matrices of these common latent factors. We propose a new unsupervised learning method, called the common and distinctive pattern analysis (CDPA), which appropriately defines the two types of data patterns by further incorporating the common and distinctive patterns of the coefficient matrices. A consistent estimation approach is developed for high-dimensional settings, and shows reasonably good finite-sample performance in simulations. Our simulation studies and real data analysis corroborate that the proposed CDPA can provide better characterization of common and distinctive patterns and thereby benefit data mining.


62-XX Statistics


bootstrap; OnPLS; JIVE
Full Text: DOI arXiv Link


[1] ANDREW, G., ARORA, R., BILMES, J. and LIVESCU, K. (2013). Deep canonical correlation analysis. In International Conference on Machine Learning 1247-1255.
[2] BARCH, D. M., BURGESS, G. C., HARMS, M. P., PETERSEN, S. E., SCHLAGGAR, B. L., CORBETTA, M., GLASSER, M. F., CURTISS, S., DIXIT, S., FELDT, C. et al. (2013). Function in the human connectome: task-fMRI and individual differences in behavior. Neuroimage 80 169-189.
[3] BJÖRCK, A. and GOLUB, G. H. (1973). Numerical methods for computing angles between linear subspaces. Mathematics of Computation 27 579-594. · Zbl 0282.65031
[4] BUCKNER, R. L., KRIENEN, F. M., CASTELLANOS, A., DIAZ, J. C. and THOMAS YEO, B. T. (2011). The organization of the human cerebellum estimated by intrinsic functional connectivity. Journal of Neurophysiology 106 2322-2345.
[5] CAMPBELL, J. D., YAU, C., BOWLBY, R., LIU, Y., BRENNAN, K., FAN, H., TAYLOR, A. M., WANG, C., WALTER, V., AKBANI, R. et al. (2018). Genomic, pathway network, and immunologic features distinguishing squamous carcinomas. Cell reports 23 194-212.
[6] CARROLL, J. D. (1968). Generalization of canonical correlation analysis to three or more sets of variables. In Proc. Am. Psychol. Ass. 227-228.
[7] Chamberlain, G. and Rothschild, M. (1983). Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica 51 1281-1304. · Zbl 0523.90017 · doi:10.2307/1912275
[8] CRAWFORD, K. L., NEU, S. C. and TOGA, A. W. (2016). The image and data archive at the laboratory of neuro imaging. Neuroimage 124 1080-1083.
[9] DEZA, M. M. and DEZA, E. (2014). Distances on Numbers, Polynomials, and Matrices. In Encyclopedia of Distances 227-244. Springer.
[10] DICICCIO, C. J. and ROMANO, J. P. (2017). Robust permutation tests for correlation and regression coefficients. Journal of the American Statistical Association 112 1211-1220.
[11] EFRON, B. and TIBSHIRANI, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall. · Zbl 0835.62038
[12] FAN, J., LIAO, Y. and MINCHEVA, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. B. 75 603-680. · Zbl 1411.62138
[13] FENG, Q., JIANG, M., HANNIG, J. and MARRON, J. (2018). Angle-based joint and individual variation explained. Journal of Multivariate Analysis 166 241-265. · Zbl 1408.62113
[14] FUKUMIZU, K., BACH, F. R. and GRETTON, A. (2007). Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research 8 361-383. · Zbl 1222.62063
[15] GOWER, J. C. (1975). Generalized procrustes analysis. Psychometrika 40 33-51. · Zbl 0305.62038
[16] GOWER, J. C. and DIJKSTERHUIS, G. B. (2004). Procrustes problems 30. Oxford University Press.
[17] HARMAN, H. H. (1976). Modern Factor Analysis, Third, revised ed. U of Chicago Press.
[18] HOADLEY, K. A., YAU, C., HINOUE, T., WOLF, D. M., LAZAR, A. J., DRILL, E., SHEN, R., TAYLOR, A. M., CHERNIACK, A. D., THORSSON, V. et al. (2018). Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173 291-304.
[19] HORN, R. A. and JOHNSON, C. R. (1994). Topics in Matrix Analysis. Cambridge University Press, Cambridge. · Zbl 0801.15001
[20] Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321-377. · Zbl 0015.40705
[21] HUANG, H. (2017). Asymptotic behavior of support vector machine for spiked population model. Journal of Machine Learning Research 18 1-21. · Zbl 1437.62231
[22] HUBERT, L. and ARABIE, P. (1985). Comparing partitions. Journal of classification 2 193-218.
[23] JENSEN, M. A., FERRETTI, V., GROSSMAN, R. L. and STAUDT, L. M. (2017). The NCI Genomic Data Commons as an engine for precision medicine. Blood 130 453-459.
[24] Kettenring, J. R. (1971). Canonical analysis of several sets of variables. Biometrika 58 433-451. · Zbl 0225.62072 · doi:10.1093/biomet/58.3.433
[25] KISHORE KUMAR, N. and SCHNEIDER, J. (2017). Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra 65 2212-2244. · Zbl 1387.65039
[26] KOBOLDT, D., FULTON, R., MCLELLAN, M., SCHMIDT, H., KALICKI-VEIZER, J., MCMICHAEL, J., FULTON, L., DOOLING, D., DING, L. et al. (2012). Comprehensive molecular portraits of human breast tumours. Nature 490 61-70.
[27] Koltchinskii, V. and Lounici, K. (2017). Concentration inequalities and moment bounds for sample covariance operators. Bernoulli 23 110-133. · Zbl 1366.60057 · doi:10.3150/15-BEJ730
[28] LAM, C. and FAN, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. The Annals of Statistics 37 4254-4278. · Zbl 1191.62101
[29] LOCK, E. and DUNSON, D. (2013). Bayesian consensus clustering. Bioinformatics 29 2610-16.
[30] LOCK, E. F., HOADLEY, K. A., MARRON, J. S. and NOBEL, A. B. (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Annals of Applied Statistics 7 523-542. · Zbl 1454.62355
[31] LÖFSTEDT, T. and TRYGG, J. (2011). OnPLS-a novel multiblock method for the modelling of predictive and orthogonal variation. Journal of Chemometrics 25 441-455.
[32] LU, Y., HUANG, K. and LIU, C.-L. (2016). A fast projected fixed-point algorithm for large graph matching. Pattern Recognition 60 971-982. · Zbl 1414.68097
[33] MAI, Q. and ZHANG, X. (2019). An iterative penalized least squares approach to sparse canonical correlation analysis. Biometrics 75 734-744. · Zbl 1436.62598
[34] MANTEL, N. (1966). Evaluation of survival data and two new rank order statistics arising in its consideration. Cancer Chemother Rep 50 163-170.
[35] MOAKHER, M. and BATCHELOR, P. G. (2006). Symmetric positive-definite matrices: From geometry to applications and visualization. In Visualization and Processing of Tensor Fields 285-298. Springer.
[36] NADAKUDITI, R. R. and SILVERSTEIN, J. W. (2010). Fundamental limit of sample generalized eigenvalue based detection of signals in noise using relatively few signal-bearing and noise-only samples. IEEE Journal of Selected Topics in Signal Processing 4 468-480.
[37] OLIVETTI, E., SHARMIN, N. and AVESANI, P. (2016). Alignment of tractograms as graph matching. Frontiers in Neuroscience 10 554.
[38] ONATSKI, A. (2010). Determining the number of factors from empirical distribution of eigenvalues. The Review of Economics and Statistics 92 1004-1016.
[39] PAPADIAS, C. B. (2000). Globally convergent blind source separation based on a multiuser kurtosis maximization criterion. IEEE Transactions on Signal Processing 48 3508-3519.
[40] PARKER, J. S., MULLINS, M., CHEANG, M. C., LEUNG, S., VODUC, D., VICKERY, T., DAVIES, S., FAURON, C., HE, X., HU, Z. et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology 27 1160-1167.
[41] PARRA, L. and SAJDA, P. (2003). Blind source separation via generalized eigenvalue decomposition. Journal of Machine Learning Research 4 1261-1269. · Zbl 1069.94513
[42] PETO, R. and PETO, J. (1972). Asymptotically efficient rank invariant test procedures. Journal of the Royal Statistical Society: Series A 135 185-198. · Zbl 0306.62012
[43] SAEED, U., COMPAGNONE, J., AVIV, R. I., STRAFELLA, A. P., BLACK, S. E., LANG, A. E. and MASELLIS, M. (2017). Imaging biomarkers in Parkinson’s disease and Parkinsonian syndromes: current and emerging concepts. Translational Neurodegeneration 6 8.
[44] SCHOUTEDEN, M., VAN DEUN, K., PATTYN, S. and VAN MECHELEN, I. (2013). SCA with rotation to distinguish common and distinctive information in linked data. Behavior Research Methods 45 822-833.
[45] SHU, H., WANG, X. and ZHU, H. (2020). D-CCA: A decomposition-based canonical correlation analysis for high-dimensional datasets. J. Am. Stat. Assoc. 115 292-306. · Zbl 1437.62211
[46] SMILDE, A. K., MÅGE, I., NÆS, T., HANKEMEIER, T., LIPS, M. A., KIERS, H. A. L., ACAR, E. and BRO, R. (2017). Common and distinct components in data fusion. Journal of Chemometrics 31 e2900.
[47] SMILDE, A. K., WESTERHUIS, J. A. and DE JONG, S. (2003). A framework for sequential multiblock component methods. Journal of Chemometrics 17 323-337.
[48] SONG, Y., SCHREIER, P. J., RAMÍREZ, D. and HASIJA, T. (2016). Canonical correlation analysis of high-dimensional data with very small sample support. Signal Processing 128 449-458.
[49] TENENHAUS, A. and TENENHAUS, M. (2011). Regularized generalized canonical correlation analysis. Psychometrika 76 257. · Zbl 1284.62753
[50] UDELL, M. and TOWNSEND, A. (2019). Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1 144-160. · Zbl 1513.68057
[51] VAN DER KLOET, F. M., SEBASTIÁN-LEÓN, P., CONESA, A., SMILDE, A. K. and WESTERHUIS, J. A. (2016). Separating common from distinctive variation. BMC Bioinformatics 17 S195.
[52] Van Essen, D. C., Smith, S. M., Barch, D. M., Behrens, T. E., Yacoub, E., Ugurbil, K., Consortium, W.-M. H. et al. (2013). The WU-Minn human connectome project: An overview. NeuroImage 80 62-79.
[53] WANG, W. and FAN, J. (2017). Asymptotics of empirical eigenstructure for high dimensional spiked covariance. The Annals of Statistics 45 1342-1374. · Zbl 1373.62299
[54] WARD, J. H. JR. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58 236-244.
[55] WEINER, M. W., VEITCH, D. P., AISEN, P. S., BECKETT, L. A., CAIRNS, N. J., GREEN, R. C., HARVEY, D., JACK, C. R., JAGUST, W., LIU, E. et al. (2013). The Alzheimer’s Disease Neuroimaging Initiative: a review of papers published since its inception. Alzheimer’s & Dementia 9 e111-e194.
[56] YIN, Y.-Q., BAI, Z.-D. and KRISHNAIAH, P. R. (1988). On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probab. Theory Rel. 78 509-521. · Zbl 0627.62022
[57] Yu, Y., Wang, T. and Samworth, R. J. (2015). A useful variant of the Davis-Kahan theorem for statisticians. Biometrika 102 315-323. · Zbl 1452.15010 · doi:10.1093/biomet/asv008
[58] ZHOU, G., CICHOCKI, A., ZHANG, Y. and MANDIC, D. P. (2016). Group component analysis for multiblock data: Common and individual feature extraction. IEEE Trans. Neural Netw. Learn. Syst. 27 2426-2439.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.