×

A general framework for association analysis of heterogeneous data. (English) Zbl 1405.62068

Summary: Multivariate association analysis is of primary interest in many applications. Despite the prevalence of high-dimensional and non-Gaussian data (such as count-valued or binary), most existing methods only apply to low-dimensional data with continuous measurements. Motivated by the Computer Audition Lab 500-song (CAL500) music annotation study, we develop a new framework for the association analysis of two sets of high-dimensional and heterogeneous (continuous/binary/count) data. We model heterogeneous random variables using exponential family distributions, and exploit a structured decomposition of the underlying natural parameter matrices to identify shared and individual patterns for two data sets. We also introduce a new measure of the strength of association, and a permutation-based procedure to test its significance. An alternating iteratively reweighted least squares algorithm is devised for model fitting, and several variants are developed to expedite computation and achieve variable selection. The application to the CAL500 data sheds light on the relationship between acoustic features and semantic annotations, and provides effective means for automatic music annotation and retrieval.

MSC:

62H20 Measures of association (correlation, canonical correlation, etc.)
62P35 Applications of statistics to physics
62J12 Generalized linear models (logistic models)

Software:

PMA; JIVE; MULAN

References:

[1] Barrington, L., Chan, A., Turnbull, D. and Lanckriet, G. (2007). Audio information retrieval using semantic similarity. In International Conference on Acoustics, Speech and Signal Processing2 725–728. IEEE, New York.
[2] Bertin-Mahieux, T., Eck, D., Maillet, F. and Lamere, P. (2008). Autotagger: A model for predicting social tags from acoustic features on large music databases. J. New Music Res.37 115–135.
[3] Björck, K. and Golub, G. H. (1973). Numerical methods for computing angles between linear subspaces. Math. Comp.27 579–594. · Zbl 0282.65031 · doi:10.2307/2005662
[4] Browne, M. W. (1979). The maximum-likelihood solution in inter-battery factor analysis. Br. J. Math. Stat. Psychol.32 75–86. · Zbl 0404.62079 · doi:10.1111/j.2044-8317.1979.tb00753.x
[5] Chaudhuri, K., Kakade, S. M., Livescu, K. and Sridharan, K. (2009). Multi-view clustering via canonical correlation analysis. In Proceedings of the 26th Annual International Conference on Machine Learning 129–136. ACM, New York.
[6] Chen, X. and Liu, H. (2012). An efficient optimization algorithm for structured sparse cca, with applications to eqtl mapping. Stat. Biosci.4 3–26.
[7] Chen, M., Gao, C., Ren, Z. and Zhou, H. H. (2013). Sparse cca via precision adjusted iterative thresholding. ArXiv preprint. Available at arXiv:1311.6186.
[8] Cheng, J., Li, T., Levina, E. and Zhu, J. (2017). High-dimensional mixed graphical models. J. Comput. Graph. Statist.26 367–378.
[9] Collins, M., Dasgupta, S. and Schapire, R. E. (2001). A generalization of principal components analysis to the exponential family. In NIPS’01: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic 617–624. MIT Press, Cambridge, MA.
[10] Ellis, D. P., Whitman, B., Berenzweig, A. and Lawrence, S. (2002). The quest for ground truth in musical artist similarity. In ISMIR 2002 Conference Proceedings: Third International Conference on Music Information Retrieval: October 13–17, 2002, IRCAM-Centre Pompidou, Paris, France.
[11] Goldsmith, J., Zipunnikov, V. and Schrack, J. (2015). Generalized multilevel function-on-scalar regression and principal component analysis. Biometrics71 344–353. · Zbl 1390.62259 · doi:10.1111/biom.12278
[12] Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations, 4th ed. Johns Hopkins Studies in the Mathematical Sciences. Johns Hopkins Univ. Press, Baltimore, MD. · Zbl 1268.65037
[13] Goto, M. and Hirata, K. (2004). Recent studies on music information processing. Acoust. Sci. Technol.25 419–425.
[14] Hastie, T., Tibshirani, R. and Wainwright, M. (2015). Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, FL. · Zbl 1319.68003
[15] Herlocker, J. L., Konstan, J. A. and Riedl, J. (2000). Explaining collaborative filtering recommendations. In Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work 241–250. ACM, New York.
[16] Hotelling, H. (1936). Relations between two sets of variates. Biometrika28 321–377. · Zbl 0015.40705 · doi:10.1093/biomet/28.3-4.321
[17] Jia, Y., Salzmann, M. and Darrell, T. (2010). Factorized latent spaces with structured sparsity. Adv. Neural Inf. Process. Syst. 982–990.
[18] Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Discrete Multivariate Distributions165. Wiley, New York. · Zbl 0868.62048
[19] Klami, A., Virtanen, S. and Kaski, S. (2010). Bayesian exponential family projections for coupled data sources. In The Twenty-Sixth Conference on Uncertainty in Artificial Intelligence 286–293. AUAI Press.
[20] Klami, A., Virtanen, S. and Kaski, S. (2013). Bayesian canonical correlation analysis. J. Mach. Learn. Res.14 965–1003. · Zbl 1320.62134
[21] Li, G. and Gaynanova, I. (2018). Supplement to “A general framework for association analysis of heterogeneous data.” DOI:10.1214/17-AOAS1127SUPP.
[22] Li, Q., Cheng, G., Fan, J. and Wang, Y. (2018). Embracing the blessing of dimensionality in factor models. J. Amer. Statist. Assoc.113 380–389. · Zbl 1398.62137
[23] Lock, E. F., Hoadley, K. A., Marron, J. S. and Nobel, A. B. (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat.7 523–542. · Zbl 1454.62355 · doi:10.1214/12-AOAS597
[24] Logan, B. (2000). Mel frequency cepstral coefficients for music modeling. In International Symposium on Music Information Retrieval (ISMIR).
[25] Luo, C., Liu, J., Dey, D. K. and Chen, K. (2016). Canonical variate regression. Biostatistics17 468–483.
[26] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd ed. Chapman & Hall, London. [Second edition of MR0727836.] · Zbl 0744.62098
[27] She, Y. (2013). Reduced rank vector generalized linear models for feature extraction. Stat. Interface6 197–209. · Zbl 1327.62431 · doi:10.4310/SII.2013.v6.n2.a4
[28] Trygg, J. and Wold, S. (2003). O2–PLS, a two-block (X–Y) latent variable regression (LVR) method with an integral OSC filter. J. Chemom.17 53–64.
[29] Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J. and Vlahavas, I. (2011). Mulan: A Java library for multi-label learning. J. Mach. Learn. Res.12 2411–2414. · Zbl 1280.68207
[30] Tucker, L. R. (1958). An inter-battery method of factor analysis. Psychometrika23 111–136. · Zbl 0097.35102 · doi:10.1007/BF02289009
[31] Turnbull, D., Barrington, L., Torres, D. and Lanckriet, G. (2007). Towards musical query-by-semantic-description using the cal500 data set. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 439–446. ACM, New York.
[32] Turnbull, D., Barrington, L., Torres, D. and Lanckriet, G. (2008). Semantic annotation and retrieval of music and sound effects. IEEE/ACM Trans. Audio Speech Lang. Process.16 467–476.
[33] Virtanen, S., Klami, A. and Kaski, S. (2011). Bayesian cca via group sparsity. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011) 457–464. ACM, New York.
[34] Westerhuis, J. A., Kourti, T. and MacGregor, J. F. (1998). Analysis of multiblock and hierarchical PCA and PLS models. J. Chemom.12 301–321.
[35] Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics10 513–534. · Zbl 1437.62658
[36] Yang, D., Ma, Z. and Buja, A. (2014). A sparse singular value decomposition method for high-dimensional data. J. Comput. Graph. Statist.23 923–942. · doi:10.1080/10618600.2013.858632
[37] Yang, Z., Ning, Y. and Liu, H. (2014). On semiparametric exponential family graphical models. ArXiv preprint. Available at arXiv:1412.8697.
[38] Zhou, G., Cichocki, A., Zhang, Y. and Mandic, D. P. (2016a). Group component analysis for multiblock data: Common and individual feature extraction. IEEE Trans. Neural Netw. Learn. Syst.27 2426–2439.
[39] Zhou, G., Zhao, Q., Zhang, Y., Adali, T., Xie, S. and Cichocki, A. (2016b). Linked component analysis from matrices to high-order tensors: Applications to biomedical data. Proc. IEEE104 310–331.
[40] Zoh, R. S., Mallick, B., Ivanov, I., Baladandayuthapani, V., Manyam, G., Chapkin, R. S., Lampe, J. W. and Carroll, R. J. (2016). PCAN: Probabilistic correlation analysis of two non-normal data sets. Biometrics72 1358–1368. · Zbl 1390.62325 · doi:10.1111/biom.12516
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.