×

Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. (English) Zbl 1454.62355

Summary: Research in several fields now requires the analysis of data sets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such data sets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data and provides new directions for the visual exploration of joint and individual structures. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.
Data and software are available at https://genome.unc.edu/jive/.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H25 Factor analysis and principal components; correspondence analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-08 Computational methods for problems pertaining to statistics
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Adourian, A., Jennings, E., Balasubramanian, R., Hines, W., Damian, D., Plasterer, T., Clish, C., Stroobant, P., McBurney, R., Verheij, E., Bobeldijk, I., Greef, J., Lindberg, J., Kenne, K., Andersson, U., Hellmold, H., Nilsson, K., Salterd, H. and Schuppe-Koistinenc, I. (2008). Correlation network analysis for data integration and biomarker selection. Molecular BioSystems 4 249-259.
[2] Bekaert, G., Hodrick, R. and Zhang, X. (2009). International stock return comovements. J. Finance 64 2591-2626.
[3] Bredel, M., Scholtens, D. M., Harsh, G. R., Bredel, C., Chandler, J. P., Renfrow, J. J., Yadav, A. K., Vogel, H., Scheck, A. C., Tibshirani, R. and Sikic, B. I. (2009). A network model of a cooperative genetic landscape in brain tumors. JAMA 302 261-275.
[4] Cabanski, C. R., Qi, Y., Yin, X., Bair, E., Hayward, M. C., Fan, C., Li, J., Wilkerson, M. D., Marron, J. S., Perou, C. M. and Hayes, D. N. (2010). SWISS MADE: Standardized within class sum of squares to evaluate methodologies and dataset elements. PLoS ONE 5 e9905.
[5] Cancer Genome Atlas Research Network (2008). Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 455 1061-1068.
[6] Candes, E., Li, X., Ma, Y. and Wright, J. (2009). Robust principal component analysis? Available at . 0912.3599 · Zbl 1327.62369 · doi:10.1145/1970392.1970395
[7] Di, C.-Z., Crainiceanu, C. M., Caffo, B. S. and Punjabi, N. M. (2009). Multilevel functional principal component analysis. Ann. Appl. Stat. 3 458-488. · Zbl 1160.62061 · doi:10.1214/08-AOAS206
[8] Dweep, H., Sticht, C., Pandey, P. and Gretz, N. (2011). miRWalk-database: Prediction of possible miRNA binding sites by “walking” the genes of three genomes. J. Biomed. Inform. 44 839-847.
[9] Fowler, A., Thompson, D., Giles, K., Maleki, S., Mreich, E., Wheeler, H., Leedman, P., Biggs, M., Cook, R., Little, N., Robinson, B. and McDonald, K. (2011). miR-124a is frequently down-regulated in glioblastoma and is involved in migration and invasion. European Journal of Cancer 47 953-963.
[10] Galberin, M. and Cochrane, G. (2011). The 2011 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res. 39 D1-D6.
[11] Gilad, Y., Rifkin, S. A. and Pritchard, J. K. (2008). Revealing the architecture of gene regulation: The promise of eQTL studies. Trends Genet. 24 408-415.
[12] Gillan, L., Matei, D., Fishman, D., Gerbin, C., Karlan, B. and Chang, D. (2002). Periostin secreted by epithelial ovarian carcinoma is a ligand for alpha(V)beta(3) and alpha(V)beta(5) integrins and promotes cell motility. Cancer Research 62 5358-5364.
[13] Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321-377. · Zbl 0015.40705 · doi:10.1093/biomet/28.3-4.321
[14] Lê Cao, K.-A., Rossouw, D., Robert-Granié, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. Biol. 7 Art. 35, 31. · Zbl 1276.62061 · doi:10.2202/1544-6115.1390
[15] Lee, M., Shen, H., Huang, J. Z. and Marron, J. S. (2010). Biclustering via sparse singular value decomposition. Biometrics 66 1087-1095. · Zbl 1233.62182 · doi:10.1111/j.1541-0420.2010.01392.x
[16] Lock, E., Hoadley, K., Marron, J. and Nobel, A. (2012). Supplement to “Joint and individual variation explained (JIVE) for integrated analysis of multiple data types”. . · Zbl 1454.62355
[17] Parkhomenko, E., Tritchler, D. and Beyene, J. (2009). Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8 Art. 1, 36. · Zbl 1276.92071 · doi:10.2202/1544-6115.1406
[18] Parkinson, H., Kapushesky, M., Kolesnikov, N., Rustici, G., Shojatalab, M., Abeygunawardena, N., Berube, H., Dylag, M., Emam, I., Farne, A., Holloway, E., Lukk, M., Malone, J., Mani, R., Pilicheva, E., Rayner, T., Rezwan, F., Sharma, A., Williams, E., Bradley, X., Adamusiak, T., Brandizi, M., Burdett, T., Coulson, R., Krestyaninova, M., Kurnosov, P., Maguire, E., Neogi, S., Rocca-Serra, P., Sansone, S., Sklyar, N., Zhao, M., Sarkans, U. and Brazma, A. (2009). ArrayExpress update-from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Res. 37 868-872.
[19] Peter, M. E. (2010). Targeting of mRNAs by multiple miRNAs: The next step. Oncogene 29 2161-2164.
[20] Rhead, B., Karolchik, D., Kuhn, R., Hinrichs, A., Zweig, A., Fujita, P., Diekhans, M., Smith, K., Rosenbloom, K., Raney, B., Pohl, A., Pheasant, M., Meyer, L., Learned, K., Hsu, F., Hillman-Jackson, J., Harte, R., Giardine, B., Dreszer, T., Clawson, H., Barber, G., Haussler, D. and Kent, W. (2010). The UCSC genome browser database: Update 2010. Nucleic Acids Res. 38 613-619.
[21] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[22] Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. J. Multivariate Anal. 99 1015-1034. · Zbl 1141.62049 · doi:10.1016/j.jmva.2007.06.007
[23] Shen, R., Olshen, A. B. and Ladanyi, M. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25 2906-2912. · Zbl 1254.92006 · doi:10.1007/978-3-540-74891-5
[24] Sporns, O., Tononi, G. and Kötter, R. (2005). The human connectome: A structural description of the human brain. PLoS Comput. Biol. 1 e42.
[25] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[26] Trygg, J. and Wold, S. (2003). O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. Journal of Chemometrics 17 53-64.
[27] Verhaak, R. G. W., Hoadley, K. A., Purdom, E., Wang, V., Qi, Y., Wilkerson, M. D., Miller, C. R., Ding, L., Golub, T., Mesirov, J. P., Alexe, G., Lawrence, M., O’Kelly, M., Tamayo, P., Weir, B. A., Gabriel, S., Winckler, W., Gupta, S., Jakkula, L., Feiler, H. S., Hodgson, J. G., James, C. D., Sarkaria, J. N., Brennan, C., Kahn, A., Spellman, P. T., Wilson, R. K., Speed, T. P., Gray, J. W., Meyerson, M., Getz, G., Perou, C. M., Hayes, D. N. and Cancer Genome Atlas Research Network (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 17 98-110.
[28] Westerhuis, J., Kourti, T. and MacGregor, J. (1998). Analysis of multiblock and hierarchical PCA and PLS models. Journal of Chemometrics 12 301-321.
[29] Witten, D. M. and Tibshirani, R. J. (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8 Art. 28, 29. · Zbl 1276.62099 · doi:10.2202/1544-6115.1470
[30] Wold, H. (1985). Partial Least Squares. In Encyclopedia of Statistical Sciences ( Vol. 6) (S. Kotz and N. Johnson, eds.) 581-591. Wiley, New York.
[31] Wold, S., Kettaneh, N. and Tjessem, K. (1996). Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection. Journal of Chemometrics 10 463-482.
[32] Zinn, P., Majadan, B., Sathyan, P., Singh, K., Majumder, S., Jolesz, F. and Colen, R. (2011). Radiogenomic mapping of edema/cellular invasion MRI-phenotypes in glioblastoma multiforme. PLoS ONE 6 e25451.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.