×

Three-way clustering of multi-tissue multi-individual gene expression data using semi-nonnegative tensor decomposition. (English) Zbl 1423.62152

Summary: The advent of high-throughput sequencing technologies has led to an increasing availability of large multi-tissue data sets which contain gene expression measurements across different tissues and individuals. In this setting, variation in expression levels arises due to contributions specific to genes, tissues, individuals, and interactions thereof. Classical clustering methods are ill-suited to explore these three-way interactions and struggle to fully extract the insights into transcriptome complexity contained in the data. We propose a new statistical method, called MultiCluster, based on semi-nonnegative tensor decomposition which permits the investigation of transcriptome variation across individuals and tissues simultaneously. We further develop a tensor projection procedure which detects covariate-related genes with high power, demonstrating the advantage of tensor-based methods in incorporating information across similar tissues. Through simulation and application to the GTEx RNA-seq data from 53 human tissues, we show that MultiCluster identifies three-way interactions with high accuracy and robustness.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
92D20 Protein sequences, DNA sequences
PDF BibTeX XML Cite
Full Text: DOI Euclid

References:

[1] Allen, G. (2012). Sparse higher-order principal components analysis. In Proc. Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR 27-36. Available at http://proceedings.mlr.press/v22/allen12.html.
[2] Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M. and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. J. Mach. Learn. Res.15 2773-2832. · Zbl 1319.62109
[3] Bahcall, O. G. (2015). Human genetics: GTEx pilot quantifies eQTL variation across tissues and individuals. Nat. Rev. Genet.16 375.
[4] Battle, A., Brown, C. D., Engelhardt, B. E., Montgomery, S. B., Consortium, G. et al. (2017). Genetic effects on gene expression across human tissues. Nature550 204-213.
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B57 289-300. · Zbl 0809.62014
[6] Candès, E. J. and Recht, B. (2009). Exact matrix completion via convex optimization. Found. Comput. Math.9 717-772. · Zbl 1219.90124
[7] Carrasquillo, M. M., Zou, F., Pankratz, V. S., Wilcox, S. L., Ma, L., Walker, L. P., Younkin, S. G., Younkin, C. S., Younkin, L. H., Bisceglio, G. D. et al. (2009). Genetic variation in PCDH11X is associated with susceptibility to late-onset Alzheimer’s disease. Nat. Genet.41 192-198.
[8] GTEx Consortium (2015). The genotype-tissue expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science348 648-660.
[9] De Lathauwer, L. (2006). A link between the canonical decomposition in multilinear algebra and simultaneous matrix diagonalization. SIAM J. Matrix Anal. Appl.28 642-666. · Zbl 1126.15007
[10] de Silva, V. and Lim, L.-H. (2008). Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl.30 1084-1127. · Zbl 1167.14038
[11] Dey, K. K., Hsiao, C. J. and Stephens, M. (2017). Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genet.13 e1006599.
[12] Fishilevich, S., Zimmerman, S., Kohn, A., Stein, T. I., Olender, T., Kolker, E., Safran, M. and Lancet, D. (2016). Genic insights from integrated human proteomics in GeneCards. Database (Oxford) 2016.
[13] Gao, C., McDowell, I. C., Zhao, S., Brown, C. D. and Engelhardt, B. E. (2016). Context specific and differential gene co-expression networks via Bayesian biclustering. PLoS Comput. Biol.12 e1004791.
[14] Hawrylycz, M. J., Lein, S., Guillozet-Bongaarts, A. L., Shen, E. H., Ng, L., Miller, J. A., Van De Lagemaat, L. N., Smith, K. A., Ebbert, A., Riley, Z. L. et al. (2012). An anatomically comprehensive atlas of the adult human brain transcriptome. Nature489 391.
[15] Hillar, C. J. and Lim, L.-H. (2013). Most tensor problems are NP-hard. J. ACM60 Art. 45, 39. · Zbl 1281.68126
[16] Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Stud. Appl. Math.6 164-189. · JFM 53.0095.01
[17] Hore, V., Viñuela, A., Buil, A., Knight, J., McCarthy, M. I., Small, K. and Marchini, J. (2016). Tensor decomposition for multiple-tissue gene expression experiments. Nat. Genet.48 1094-1100.
[18] Kelley, G. A. and Kelley, K. S. (2012). Statistical models for meta-analysis: A brief tutorial. World J. Methodol.2 27-32.
[19] Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Rev.51 455-500. · Zbl 1173.65029
[20] Kruskal, J. B. (1977). Three-way arrays: Rank and uniqueness of trilinear decompositions, with application to arithmetic complexity and statistics. Linear Algebra Appl.18 95-138. · Zbl 0364.15021
[21] Lam, A. D., Deck, G., Goldman, A., Eskandar, E. N., Noebels, J. and Cole, A. J. (2017). Silent hippocampal seizures and spikes identified by foramen ovale electrodes in Alzheimer’s disease. Nat. Med.23 678-680.
[22] Lazzeroni, L. and Owen, A. (2002). Plaid models for gene expression data. Statist. Sinica12 61-86. · Zbl 1004.62084
[23] Lee, S. and Huang, J. Z. (2014). A biclustering algorithm for binary matrices based on penalized Bernoulli likelihood. Stat. Comput.24 429-441. · Zbl 1325.62013
[24] Lee, S., Huang, J. Z. and Hu, J. (2010). Sparse logistic principal components analysis for binary data. Ann. Appl. Stat.4 1579-1601. · Zbl 1202.62084
[25] Lim, L.-H. (2005). Singular values and eigenvalues of tensors: A variational approach. In 2005 1st IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing 129-132. IEEE, New York.
[26] Liu, Y., Hayes, D. N., Nobel, A. and Marron, J. S. (2008). Statistical significance of clustering for high-dimension, low-sample size data. J. Amer. Statist. Assoc.103 1281-1293. · Zbl 1205.62079
[27] Lock, E. F., Hoadley, K. A., Marron, J. S. and Nobel, A. B. (2013). Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann. Appl. Stat.7 523-542. · Zbl 1454.62355
[28] Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., Hasz, R., Walters, G., Garcia, F., Young, N. et al. (2013). The genotype-tissue expression (GTEx) project. Nat. Genet.45 580-585.
[29] Melé, M., Ferreira, P. G., Reverter, F., DeLuca, D. S., Monlong, J., Sammeth, M., Young, T. R., Goldmann, J. M., Pervouchine, D. D., Sullivan, T. J. et al. (2015). The human transcriptome across tissues and individuals. Science348 660-665.
[30] Mori, F., Tanji, K., Miki, Y., Toyoshima, Y., Yoshida, M., Kakita, A., Takahashi, H., Utsumi, J., Sasaki, H. and Wakabayashi, K. (2016). G protein-coupled receptor 26 immunoreactivity in intranuclear inclusions associated with polyglutamine and intranuclear inclusion body diseases. Neuropathology36 50-55.
[31] Mu, C., Hsu, D. and Goldfarb, D. (2015). Successive rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM J. Matrix Anal. Appl.36 1638-1659. · Zbl 1330.15030
[32] Omberg, L., Golub, G. H. and Alter, O. (2007). A tensor higher-order singular value decomposition for integrative analysis of DNA microarray data from different studies. Proc. Natl. Acad. Sci. USA104 18371-18376.
[33] Pierson, E., Koller, D., Battle, A., Mostafavi, S., Consortium, G. et al. (2015). Sharing and specificity of co-expression networks across 35 human tissues. PLoS Comput. Biol.11 e1004220.
[34] Priddle, T. H. and Crow, T. J. (2013). The protocadherin 11X/Y (PCDH11X/Y) gene pair as determinant of cerebral asymmetry in modern Homo sapiens. Ann. N.Y. Acad. Sci.1288 36-47.
[35] Tucker, L. R. (1966). Some mathematical notes on three-mode factor analysis. Psychometrika31 279-311.
[36] van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-SNE. J. Mach. Learn. Res.9 2579-2605. · Zbl 1225.68219
[37] Veerappa, A. M., Saldanha, M., Padakannaya, P. and Ramachandra, N. B. (2013). Genome-wide copy number scan identifies disruption of PCDH11X in developmental dyslexia. Am. J. Med. Genet., Part B Neuropsychiatr. Genet.162 889-897.
[38] Wang, M., Fischer, J. and Song, Y. S. (2019). Supplement to “Three-way clustering of multi-tissue multi-individual gene expression data using semi-nonnegative tensor decomposition.” DOI:10.1214/18-AOAS1228SUPP.
[39] Wang, M. and Song, Y. S. (2017). Tensor decompositions via two-mode higher-order SVD (HOSVD). In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research54 614-622.
[40] Wang, M., Dao Duc, K., Fischer, J. and Song, Y. S. (2017). Operator norm inequalities between tensor unfoldings on the partition lattice. Linear Algebra Appl.520 44-66. · Zbl 1359.15014
[41] Yang, J., Huang, T., Petralia, F., Long, Q., Zhang, B., Argmann, C., Zhao, Y., Mobbs, C. V., Schadt, E. E., Zhu, J. et al. (2015). Synchronized age-related gene expression changes across multiple tissues in human and the link to complex diseases. Sci. Rep.5 15145.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.