A hierarchical Bayesian model for inference of copy number variants and their association to gene expression. (English) Zbl 1454.62315

Summary: A number of statistical models have been successfully developed for the analysis of high-throughput data from a single source, but few methods are available for integrating data from different sources. Here we focus on integrating gene expression levels with comparative genomic hybridization (CGH) array measurements collected on the same subjects. We specify a measurement error model that relates the gene expression levels to latent copy number states which, in turn, are related to the observed surrogate CGH measurements via a hidden Markov model. We employ selection priors that exploit the dependencies across adjacent copy number states and investigate MCMC stochastic search techniques for posterior inference. Our approach results in a unified modeling framework for simultaneously inferring copy number variants (CNV) and identifying their significant associations with mRNA transcripts abundance. We show performance on simulated data and illustrate an application to data from a genomic study on human cancer cell lines.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
Full Text: DOI arXiv Euclid


[1] Barnes, C., Plagnol, V., Fitzgerald, T., Redon, R., Marchini, J., Clayton, D. and Hurles, M. E. (2008). A robust statistical method for case-control association testing with copy number variation. Nature Genetics 40 1245-1252.
[2] Belfiore, A., Genua, M. and Malaguarnera, R. (2009). PPAR-gamma agonists and their effects on IGF-I receptor signaling: Implications for cancer. PPAR Research 2009 Article ID 830501.
[3] Breheny, P., Chalise, P., Batzler, A., Wang, L. and Fridley, B. L. (2012). Genetic association studies of copy-number variation: Should assignment of copy number states precede testing? PLoS ONE 7 e34262.
[4] Broet, P., Lewin, A., Richardson, S., Dalmasso, C. and Magdelenat, H. (2004). A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments. Bioinformatics 20 2562-2571.
[5] Brown, P. J., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 627-641. · Zbl 0909.62022
[6] Bussey, K. J., Chin, K., Lababidi, S., Reimers, M., Reinhold, W. C., Ku, W.-L., Gwadry, F., Kouros-Mehr, A. H., Fridlyand, J., Jain, A., Collins, C., Nishizuka, S., Tonon, G., Roschke, A., Gehlhaus, K., Kirsch, I., Scudiero, D. A., Gray, J. W. and Weinstein, J. N. (2006). Integrating data on DNA copy number with gene expression levels and drug sensitivities in the NCI-60 cell line panel. Molecular Cancer Therapeutics 5 853-867.
[7] Cardin, N., Holmes, C., Donnelly, P., Wellcome Trust Case Control Consortium and Marchini, J. (2011). Bayesian hierarchical mixture modeling to assign copy number from a targeted CNV array. Genetic Epidemiology 35 536-548.
[8] Cassese, A., Guindani, M., Tadesse, M. G., Falciani, F. and Vannucci, M. (2014). Supplement to “A hierarchical Bayesian model for inference of copy number variants and their association to gene expression.” . · Zbl 1454.62315
[9] Chen, X., Wang, L. and Ishwaran, H. (2010). An integrative pathway-based clinical-genomic model for cancer survival prediction. Statist. Probab. Lett. 80 1313-1319. · Zbl 1198.62158
[10] Chin, K., DeVries, S., Fridlyand, J., Spellman, P. T., Roydasgupta, R., Kuo, W. L., Lapuk, A., Neve, R. M., Qian, Z., Ryder, T., Chen, F., Feiler, H., Tokuyasu, T., Kingsley, C., Dairkee, S., Meng, Z., Chew, K., Pinkel, D., Jain, A., Ljung, B. M., Esserman, L., Albertson, D. G., Waldman, F. M. and Gray, J. W. (2006). Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. Cancer Cell. 10 529-541.
[11] Choi, H., Qin, Z. S. and Ghosh, D. (2010). A double-layered mixture model for the joint analysis of DNA copy number and gene expression data. J. Comput. Biol. 17 121-137.
[12] Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: An objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleid Acids Research 35 2013-2025.
[13] Cordell, H. J. (2002). Epistasis: What it means, what it doesn’t mean, and statistical methods to detect it in humans. Human Molecular Genetics 11 2463-2468.
[14] Costa, T., Guindani, M., Bassetti, F., Leisen, F. and Airoldi, E. M. (2013). Generalized species sampling priors with latent beta reinforcements. Available at . 1012.0866 · Zbl 1368.62125
[15] Dalenc, F., Drouet, J., Ader, I., Delmas, C., Rochaix, P., Favre, G., Cohen-Jonathan, E. and Toulas, C. (2012). Increased expression of a COOH-truncated nucleophosmin resulting from alternative splicing is associated with cellular resistance to ionizing radiation in HeLa cells. Int. J. Cancer 100 662-668.
[16] Drier, Y., Sheffer, M. and Domany, E. (2013). Pathway-based personalized analysis of cancer. Proc. Natl. Acad. Sci. USA 110 6388-6393.
[17] Du, L., Chen, M., Lucas, J. and Carin, L. (2010). Sticky hidden Markov modeling of comparative genomic hybridization. IEEE Trans. Signal Process. 58 5353-5368. · Zbl 1391.92014
[18] Fox, E. B., Sudderth, E. B., Jordan, M. I. and Willsky, A. S. (2011). A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5 1020-1056. · Zbl 1232.62077
[19] George, E. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339-373. · Zbl 0884.62031
[20] Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics , 4 ( PeñíScola , 1991) 169-193. Oxford Univ. Press, New York.
[21] Guha, S., Li, Y. and Neuberg, D. (2008). Bayesian hidden Markov modeling of array CGH data. J. Amer. Statist. Assoc. 103 485-497. · Zbl 1469.62368
[22] Heidelberger, P. and Welch, P. D. (1981). A spectral method for confidence interval generation and run length control in simulations. Comm. ACM 24 233-245.
[23] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C. and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statist. Sci. 20 388-400. · Zbl 1130.62408
[24] Kaczynski, J., Hansson, G. and Wallerstedt, S. (2009). Wallerstedtincreased porphyrins in primary liver cancer mainly reflect a parallel liver disease. Gastroenterology Research and Practice 2009 Article ID 402394.
[25] Marioni, J. C., Thorne, N. P. and Tavare, S. (2006). BioHMM: A heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics 22 1144-1146.
[26] Monni, S. and Tadesse, M. G. (2009). A stochastic partitioning method to associate high-dimensional responses and covariates. Bayesian Anal. 4 413-436. · Zbl 1330.62035
[27] Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64 479-489, 667. · Zbl 1137.62399
[28] Newton, M. A., Noueiry, A., Sarkar, D. and Ahlquist, P. (2004). Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5 155-176. · Zbl 1096.62124
[29] Noor, R., Mittal, S. and Iqbal, J. (2002). Superoxide dismutase-applications and relevance to human diseases. Med. Sci. Monit. 8 9.
[30] Ovacik, M. A., Sukumaran, S., Almon, R. R., DuBois, D. C., Jusko, W. J. and Androulakis, I. P. (2010). Circadian signatures in rat liver: From gene expression to pathways. BMC Bioinformatics 11 540.
[31] Picard, F., Robin, S., Lebarbier, E. and Daudin, J.-J. (2007). A segmentation/clustering model for the analysis of array CGH data. Biometrics 63 758-766. · Zbl 1146.62047
[32] Raber, P., Ochoa, A. C. and Rodríguez, P. C. (2012). Metabolism of L-arginine by myeloid-derived suppressor cells in cancer: mechanisms of T cell suppression and therapeutic perspectives. Immunol. Invest. 41 614-634.
[33] Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W. et al. (2006). Global variation in copy number in the human genome. Nature 444 444-454.
[34] Richardson, S., Bottolo, L. and Rosenthal, J. S. (2010). Bayesian models for sparse regression analysis of high dimensional data. Bayesian Statistics 9 539-569.
[35] Richardson, S. and Gilks, W. R. (1993). Conditional independence models for epidemiological studies with covariate measurement error. Stat. Med. 12 1703-1722.
[36] Rodriguez, R. R. R., Duran, R. C. D., Falciani, F., Peña, J. G. T. and Trevino, V. (2012). COMPADRE: An R and web resource for pathway activity analysis by component decompositions. Bioinformatics 28 2701-2702.
[37] Scott-Boyer, M. P., Imholte, G. C., Tayeb, A., Labbe, A., Deschepper, C. F. and Gottardo, R. (2012). An integrated hierarchical Bayesian model for multivariate eQTL mapping. Stat. Appl. Genet. Mol. Biol. 11 1515-1544. · Zbl 1296.92071
[38] Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M. et al. (2004). Large-scale copy number polymorphism in the human genome. Science 305 525-528.
[39] Sha, N., Vannucci, M., Tadesse, M. G., Brown, P. J., Dragoni, I., Davies, N., Roberts, T. C., Contestabile, A., Salmon, M., Buckley, C. and Falciani, F. (2004). Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60 812-828. · Zbl 1274.62428
[40] Somwar, H., Erdjument-Bromage, R., Larsson, E., Shum, D., Lockwood, W. W., Yang, G., Sander, C., Ouerfelli, O., Tempst, P. J., Djaballah, H. and Varmus, H. E. (2011). Superoxide dismutase 1 (SOD1) is a target for a small molecule identified in a screen for inhibitors of the growth of lung adenocarcinoma cell lines. PNAS 108 39.
[41] Stingo, F. C., Chen, Y. A., Vannucci, M., Barrier, M. and Mirkes, P. E. (2010). A Bayesian graphical modeling approach to microRNA regulatory network inference. Ann. Appl. Stat. 4 2024-2048. · Zbl 1220.62142
[42] Stranger, B. E., Forrest, M. S., Dunning, M., Ingle, C. E., Beazley, C., Thorne, N., Redon, R., Bird, C. P., de Grassi, A., Lee, C., Tyler-Smith, C., Carter, N., Scherer, S. W., Tavaré, S., Deloukas, P., Hurles, M. E. and Dermitzakis, E. T. (2007). Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science 315 848-853.
[43] Su, J., Yoon, B.-J. and Dougherty, E. R. (2009). Accurate and reliable cancer classification based on probabilistic inference of pathway activity. PLoS ONE 4 e8161.
[44] Subirana, I., Diaz-Uriarte, R., Lucas, G. and Gonzalez, J. R. (2011). CNVassoc: Association analysis of CNV data using R. BMC Med. Genomics 4 47.
[45] Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics 23 657-663.
[46] Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F. A., Hakonarson, H. and Bucan, M. (2007). PennCNV: An integrated hidden Markov model deisigned for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17 1665-1674.
[47] Wang, K., Chen, Z., Tadesse, M. G., Glessner, J., Grant, S. F. A., Hakonarson, H., Bucan, M. and Li, M. (2008). Modeling genetic inheritance of copy number variations. Nucleid Acids Research 36 21.
[48] Wu, G., Guo, Z., Chatterjee, A., Huang, X., Rubin, E., Wu, F., Mambo, E., Chang, X., Osada, M., Kim, M. S., Moon, C., Califano, J. A., Ratovitski, E. A., Gollin, S. M., Sukumar, S., Sidransky, D. and Trink, B. (2006). Overexpression of glycosylphosphatidylinositol (GPI) transamidase subunits phosphatidylinositol glycan class T and/or GPI anchor attachment 1 induces tumorigenesis and contributes to invasion in human breast cancer. Cancer Res. 66 9829-9836.
[49] Yang, Y. and Bedford, M. T. (2013). Protein arginine methyltransferases and cancer. Nat. Rev. Cancer 13 37-50.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.