Lee, Juhee; Müller, Peter; Gulukota, Kamalakar; Ji, Yuan A Bayesian feature allocation model for tumor heterogeneity. (English) Zbl 1397.62457 Ann. Appl. Stat. 9, No. 2, 621-639 (2015). Summary: We develop a feature allocation model for inference on genetic tumor variation using next-generation sequencing data. Specifically, we record single nucleotide variants (SNVs) based on short reads mapped to human reference genome and characterize tumor heterogeneity by latent haplotypes defined as a scaffold of SNVs on the same homologous genome. For multiple samples from a single tumor, assuming that each sample is composed of some sample-specific proportions of these haplotypes, we then fit the observed variant allele fractions of SNVs for each sample and estimate the proportions of haplotypes. Varying proportions of haplotypes across samples is evidence of tumor heterogeneity since it implies varying composition of cell subpopulations. Taking a Bayesian perspective, we proceed with a prior probability model for all relevant unknown quantities, including, in particular, a prior probability model on the binary indicators that characterize the latent haplotypes. Such prior models are known as feature allocation models. Specifically, we define a simplified version of the Indian buffet process, one of the most traditional feature allocation models. The proposed model allows overlapping clustering of SNVs in defining latent haplotypes, which reflects the evolutionary process of subclonal expansion in tumor samples. Cited in 8 Documents MSC: 62P10 Applications of statistics to biology and medical sciences; meta analysis 62F15 Bayesian inference Keywords:haplotypes; feature allocation models; Indian buffet process; Markov chain Monte Carlo; next-generation sequencing; random binary matrices; variant calling Software:BWA; BM-map; Samtools; GATK; PurBayes; PurityEst; KEGG; PyClone PDF BibTeX XML Cite \textit{J. Lee} et al., Ann. Appl. Stat. 9, No. 2, 621--639 (2015; Zbl 1397.62457) Full Text: DOI arXiv Euclid References: [1] Broderick, T., Pitman, J. and Jordan, M. I. (2013). Feature allocations, probability functions, and paintboxes. Bayesian Anal. 8 801-836. · Zbl 1329.62278 [2] Broderick, T., Jordan, M. I. and Pitman, J. (2013). Clusters and features from combinatorial stochastic processes. Statist. Sci. 28 289-312. · Zbl 1331.62124 [3] Casella, G. and Moreno, E. (2006). Objective Bayesian variable selection. J. Amer. Statist. Assoc. 101 157-167. · Zbl 1118.62313 [4] Church, D. M., Schneider, V. A., Graves, T., Auger, K., Cunningham, F., Bouk, N., Chen, H.-C., Agarwala, R., McLaren, W. M., Ritchie, G. R. S. et al. (2011). Modernizing reference genome assemblies. PLoS Biol. 9 e1001091. [5] Engle, L. J., Simpson, C. L. and Landers, J. E. (2006). Using high-throughput SNP technologies to study cancer. Oncogene 25 1594-1601. [6] Erichsen, H. and Chanock, S. (2004). SNPs in cancer research and treatment. British Journal of Cancer 90 747-751. [7] Gerlinger, M., Rowan, A. J., Horswell, S., Larkin, J., Endesfelder, D., Gronroos, E., Martinez, P., Matthews, N., Stewart, A., Tarpey, P., Varela, I., Phillimore, B., Begum, S., McDonald, N. Q., Butler, A., Jones, D., Raine, K., Latimer, C., Santos, C. R., Nohadani, M., Eklund, A. C., Spencer-Dene, B., Clark, G., Pickering, L., Stamp, G., Gore, M., Szallasi, Z., Downward, J., Futreal, P. A. and Swanton, C. (2012). Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N. Engl. J. Med. 366 883-892. [8] Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711-732. · Zbl 0861.62023 [9] Griffiths, T. and Ghahramani, Z. (2005). Infinite latent feature models and the Indian buffet process. Technical Report 2005-001, Gatsby Computational Neuroscience Unit, 2005. [10] Ji, Y., Xu, Y., Zhang, Q., Tsui, K.-W., Yuan, Y., Norris, C. Jr., Liang, S. and Liang, H. (2011). BM-map: Bayesian mapping of multireads for next-generation sequencing data. Biometrics 67 1215-1224. · Zbl 1266.92049 [11] Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M. and Hirakawa, M. (2010). KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res. 38 D355-D360. [12] Landau, D. A., Carter, S. L., Stojanov, P., McKenna, A., Stevenson, K., Lawrence, M. S., Sougnez, C., Stewart, C., Sivachenko, A., Wang, L., Wan, Y., Zhang, W., Shukla, S. A., Vartanov, A., Fernandes, S. M., Saksena, G., Cibulskis, K., Tesar, B., Gabriel, S., Hacohen, N., Meyerson, M., Lander, E. S., Neuberg, D., Brown, J. R., Getz, G. and Wu, C. J. (2013). Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152 714-726. · Zbl 0563.92005 [13] Larson, N. B. and Fridley, B. L. (2013). PurBayes: Estimating tumor cellularity and subclonality in next-generation sequencing data. Bioinformatics 29 1888-1889. [14] Lee, J., Müller, P., Gulukota, K. and Ji, Y. (2015). Supplement to “A Bayesian feature allocation model for tumor heterogeneity.” . · Zbl 1397.62457 [15] Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25 1754-1760. · Zbl 1022.68519 [16] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup (2009). The sequence Alignment/Map format and SAMtools. Bioinformatics 25 2078-2079. · Zbl 1022.68519 [17] Marusyk, A. and Polyak, K. (2010). Tumor heterogeneity: Causes and consequences. Biochim. Biophys. Acta. 1085 1. [18] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. and DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297-1303. [19] Navin, N., Krasnitz, A., Rodgers, L., Cook, K., Meth, J., Kendall, J., Riggs, M., Eberling, Y., Troge, J., Grubor, V. et al. (2010). Inferring tumor progression from genomic heterogeneity. Genome Res. 20 68-80. [20] Ng, P. C. and Kirkness, E. F. (2010). Whole genome sequencing. In Genetic Variation 215-226. Springer, New York. [21] O’Hagan, A. (1995). Fractional Bayes factors for model comparison. J. R. Stat. Soc. Ser. B. Stat. Methodol. 57 99-138. · Zbl 0813.62026 [22] Roth, A., Khattra, J., Yap, D., Wan, A., Laks, E., Biele, J., Ha, G., Aparicio, S., Bouchard-Côté, A. and Shah, S. P. (2014). Pyclone: Statistical inference of clonal population structure in cancer. Nature Methods 11 396-398. [23] Russnes, H. G., Navin, N., Hicks, J. and Borresen-Dale, A.-L. (2011). Insight into the heterogeneity of breast cancer through next-generation sequencing. J. Clin. Invest. 121 3810-3818. [24] Su, X., Zhang, L., Zhang, J., Meric-Bernstam, F. and Weinstein, J. N. (2012). PurityEst: Estimating purity of human tumor samples using next-generation sequencing data. Bioinformatics 28 2265-2266. [25] Teh, Y. W., Görür, D. and Ghahramani, Z. (2007). Stick-breaking construction for the Indian buffet process. In Proceedings of the International Conference on Artificial Intelligence and Statistics , Vol. 11. The Society for Artificial Intelligence and Statistics, NJ. [26] Wersto, R. P., Liblit, R. L., Deitch, D. and Koss, L. G. (1991). Variability in DNA measurements in multiple tumor samples of human colonic carcinoma. Cancer 67 106-115. [27] Wheeler, D. A., Srinivasan, M., Egholm, M., Shen, Y., Chen, L., McGuire, A., He, W., Chen, Y.-J., Makhijani, V., Roth, G. T. et al. (2008). The complete genome of an individual by massively parallel DNA sequencing. Nature 452 872-876. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.