×

Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays. (English) Zbl 1397.62473

Summary: Understanding how genetic variants influence cellular-level processes is an important step toward understanding how they influence important organismal-level traits, or “phenotypes,” including human disease susceptibility. To this end, scientists are undertaking large-scale genetic association studies that aim to identify genetic variants associated with molecular and cellular phenotypes, such as gene expression, transcription factor binding, or chromatin accessibility. These studies use high-throughput sequencing assays (e.g., RNA-seq, ChIP-seq, DNase-seq) to obtain high-resolution data on how the traits vary along the genome in each sample. However, typical association analyses fail to exploit these high-resolution measurements, instead aggregating the data at coarser resolutions, such as genes, or windows of fixed length. Here we develop and apply statistical methods that better exploit the high-resolution data. The key idea is to treat the sequence data as measuring an underlying “function” that varies along the genome, and then, building on wavelet-based methods for functional data analysis, test for association between genetic variants and the underlying function. Applying these methods to identify genetic variants associated with chromatin accessibility (dsQTLs), we find that they identify substantially more associations than a simpler window-based analysis, and in total we identify 772 novel dsQTLs not identified by the original analysis.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
42C40 Nontrigonometric harmonic analysis involving wavelets and other special systems
92D20 Protein sequences, DNA sequences
62F15 Bayesian inference

Software:

qvalue; ChIP-PaM; WaveSeq
PDF BibTeX XML Cite
Full Text: DOI arXiv Euclid

References:

[1] Abramovich, F. and Angelini, C. (2006). Testing in mixed-effects FANOVA models. J. Statist. Plann. Inference 136 4326-4348. · Zbl 1098.62050
[2] Antoniadis, A. and Sapatinas, T. (2007). Estimation and inference in functional mixed-effects models. Comput. Statist. Data Anal. 51 4793-4813. · Zbl 1162.62341
[3] Barski, A., Cuddapah, S., Cui, K., Roh, T.-Y., Schones, D. E., Wang, Z., Wei, G., Chepelev, I. and Zhao, K. (2007). High-resolution profiling of histone methylations in the human genome. Cell 129 823-837.
[4] Benjamini, Y. and Speed, T. P. (2012). Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40 e72.
[5] Besag, J. and Clifford, P. (1991). Sequential Monte Carlo \(p\)-values. Biometrika 78 301-304.
[6] Boyle, A. P., Davis, S., Shulha, H. P., Meltzer, P., Margulies, E. H., Weng, Z., Furey, T. S. and Crawford, G. E. (2008). High-resolution mapping and characterization of open chromatin across the genome. Cell 132 311-322. · Zbl 0614.57012
[7] Cheung, V. G., Nayak, R. R., Wang, I. X., Elwyn, S., Cousins, S. M., Morley, M. and Spielman, R. S. (2010). Polymorphic cis- and trans-regulation of human gene expression. PLoS Biol. 8 e1000480.
[8] Clement, L., De Beuf, K., Thas, O., Vuylsteke, M., Irizarry, R. A. and Crainiceanu, C. M. (2012). Fast wavelet based functional models for transcriptome analysis with tiling arrays. Stat. Appl. Genet. Mol. Biol. 11 Art. 4, 38. · Zbl 1296.92024
[9] Crouse, M. S., Nowak, R. D. and Baraniuk, R. G. (1998). Wavelet-based statistical signal processing using hidden Markov models. IEEE Trans. Signal Process. 46 886-902.
[10] Dabney, A., Storey, J. D. and Warnes, G. R. (2015). qvalue: Q-value estimation for false discovery rate control. R package version 1.30.0.
[11] Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. and Noble, W. S. (2007). Unsupervised segmentation of continuous genomic data. Bioinformatics 23 1424-1426.
[12] Degner, J. F., Pai, A. A., Pique-Regi, R., Veyrieras, J.-B., Gaffney, D. J., Pickrell, J. K., De Leon, S., Michelini, K., Lewellen, N., Crawford, G. E., Stephens, M., Gilad, Y. and Pritchard, J. K. (2012). DNasel sensitivity QTLs are a major determinant of human expression variation. Nature 482 390-394.
[13] Donoho, D. L. and Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Amer. Statist. Assoc. 90 1200-1224. · Zbl 0869.62024
[14] Fan, J. and Lin, S.-K. (1998). Test of significance when data are curves. J. Amer. Statist. Assoc. 93 1007-1021. · Zbl 1064.62525
[15] Frazee, A. C., Sabunciyan, S., Hansen, K. D., Irizarry, R. A. and Leek, J. T. (2014). Differential expression analysis of RNA-seq data at single-base resolution. Biostatistics 15 413-426.
[16] Fryzlewicz, P. and Nason, G. P. (2004). A Haar-Fisz algorithm for Poisson intensity estimation. J. Comput. Graph. Statist. 13 621-638.
[17] Hesselberth, J. R., Chen, X., Zhang, Z., Sabo, P. J., Sandstrom, R., Reynolds, A. P., Thurman, R. E., Neph, S., Kuehn, M. S., Noble, W. S., Fields, S. and Stamatoyannopoulos, J. A. (2009). Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods 6 283-289.
[18] Jackman, S. (2009). Bayesian Analysis for the Social Sciences . Wiley, Chichester. · Zbl 1292.62015
[19] Johnson, D. S., Mortazavi, A., Myers, R. M. and Wold, B. (2007). Genome-wide mapping of in vivo protein-DNA interactions. Science 316 1497-1502.
[20] Karczewski, K. J., Dudley, J. T., Kukurba, K. R., Chen, R., Butte, A. J., Montgomery, S. B. and Snyder, M. (2013). Systematic functional regulatory assessment of disease-associated variants. Proc. Natl. Acad. Sci. USA 110 9607-9612.
[21] Kasowski, M., Grubert, F., Heffelfinger, C., Hariharan, M., Asabere, A., Waszak, S. M., Habegger, L., Rozowsky, J., Shi, M., Urban, A. E., Hong, M.-Y., Karczewski, K. J., Huber, W., Weissman, S. M., Gerstein, M. B., Korbel, J. O. and Snyder, M. (2010). Variation in transcription factor binding among humans. Science 328 232-235.
[22] Kolaczyk, E. D. (1999). Bayesian multiscale models for Poisson processes. J. Amer. Statist. Assoc. 94 920-933. · Zbl 1072.62630
[23] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 1724-1735.
[24] Mallat, S. G. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 11 674-693. · Zbl 0709.94650
[25] Mangravite, L. M., Engelhardt, B. E., Medina, M. W., Smith, J. D., Brown, C. D., Chasman, D. I., Mecham, B. H., Howie, B., Shim, H., Naidoo, D., Feng, Q., Rieder, M. J., Chen, Y.-D. I., Rotter, J. I., Ridker, P. M., Hopewell, J. C., Parish, S., Armitage, J., Collins, R., Wilke, R. A., Nickerson, D. A., Stephens, M. and Krauss, R. M. (2013). A statin-dependent QTL for GATM expression is associated with statin-induced myopathy. Nature 502 377-380.
[26] Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. and Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18 1509-1517.
[27] Mikkelsen, T. S., Ku, M., Jaffe, D. B., Issac, B., Lieberman, E., Giannoukos, G., Alvarez, P., Brockman, W., Kim, T.-K., Koche, R. P., Lee, W., Mendenhall, E., O’Donovan, A., Presser, A., Russ, C., Xie, X., Meissner, A., Wernig, M., Jaenisch, R., Nusbaum, C., Lander, E. S. and Bernstein, B. E. (2007). Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 448 553-560.
[28] Mitra, A. and Song, J. (2012). WaveSeq: A novel data-driven method of detecting histone modification enrichments using wavelets. PLoS ONE 7 e45486.
[29] Montgomery, S. B., Sammeth, M., Gutierrez-Arcelus, M., Lach, R. P., Ingle, C., Nisbett, J., Guigo, R. and Dermitzakis, E. T. (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464 773-777.
[30] Morris, J. S. and Carroll, R. J. (2006). Wavelet-based functional mixed models. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 179-199. · Zbl 1110.62053
[31] Morris, J. S., Brown, P. J., Herrick, R. C., Baggerly, K. A. and Coombes, K. R. (2008). Bayesian analysis of mass spectrometry proteomic data using wavelet-based functional mixed models. Biometrics 64 479-489. · Zbl 1137.62399
[32] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods 5 621-628.
[33] Nicolae, D. L., Gamazon, E., Zhang, W., Duan, S., Dolan, M. E. and Cox, N. J. (2010). Trait-associated SNPs are more likely to be eQTLs: Annotation to enhance discovery from GWAS. PLoS Genet. 6 e1000888.
[34] Pickrell, J. K., Marioni, J. C., Pai, A. A., Degner, J. F., Engelhardt, B. E., Nkadori, E., Veyrieras, J.-B., Stephens, M., Gilad, Y. and Pritchard, J. K. (2010). Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature 464 768-772.
[35] Pique-Regi, R., Degner, J. F., Pai, A. A., Boyle, A. P., Song, L., Lee, B.-K., Gaffney, D. J., Gilad, Y. and Pritchard, J. K. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21 447-455.
[36] Servin, B. and Stephens, M. (2007). Imputation-based analysis of association studies: Candidate regions and quantitative traits. PLoS Genet. 3 e114.
[37] Shim, H. and Stephens, M. (2015). Supplement to “Wavelet-based genetic association analysis of functional phenotypes arising from high-throughput sequencing assays.” . · Zbl 1397.62473
[38] Spencer, C. C. A., Deloukas, P., Hunt, S., Mullikin, J., Myers, S., Silverman, B., Donnelly, P., Bentley, D. and McVean, G. (2006). The influence of recombination on human genetic diversity. PLoS Genet. 2 e148.
[39] Stegle, O., Parts, L., Durbin, R. and Winn, J. (2010). A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 6 e1000770.
[40] Teslovich, T. M., Musunuru, K., Smith, A. V., Edmondson, A. C., Stylianou, I. M., Koseki, M., Pirruccello, J. P., Ripatti, S., Chasman, D. I., Willer, C. J., Johansen, C. T., Fouchier, S. W., Isaacs, A., Peloso, G. M., Barbalic, M., Ricketts, S. L. et al. (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466 707-713.
[41] Timmermann, K. E. and Nowak, R. D. (1999). Multiscale modeling and estimation of Poisson processes with application to photon-limited imaging. IEEE Trans. Inform. Theory 45 846-862. · Zbl 0947.94005
[42] van der Waerden, B. L. (1953). Order tests for the two-sample problem. II, III. Proceedings of the Koninklijke Nederlandse Akademie van Wetenschappen , Serie A 564 303-310, 311-316. · Zbl 0051.36302
[43] Wang, E. T., Sandberg, R., Luo, S., Khrebtukova, I., Zhang, L., Mayr, C., Kingsmore, S. F., Schroth, G. P. and Burge, C. B. (2008). Alternative isoform regulation in human tissue transcriptomes. Nature 456 470-476.
[44] WTCCC (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447 661-678.
[45] Wu, S., Wang, J., Zhao, W., Pounds, S. and Cheng, C. (2010). ChIP-PaM: An algorithm to identify protein-DNA interaction using ChIP-seq data. Theor. Biol. Med. Model 7 18.
[46] Yang, X. and Nie, K. (2008). Hypothesis testing in functional linear regression models with Neyman’s truncation and wavelet thresholding for longitudinal data. Stat. Med. 27 845-863.
[47] Zhang, Y., Shin, H., Song, J. S., Lei, Y. and Liu, X. S. (2008). Identifying positioned nucleosomes with epigenetic marks in human from ChIP-seq. BMC Genomics 9 537.
[48] Zhao, W. and Wu, R. (2008). Wavelet-based nonparametric functional mapping of longitudinal curves. J. Amer. Statist. Assoc. 103 714-725. · Zbl 1471.62562
[49] Zhu, H., Brown, P. J. and Morris, J. S. (2011). Robust, adaptive functional regression in functional mixed model framework. J. Amer. Statist. Assoc. 106 1167-1179. · Zbl 1229.62053
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.