×

Selection-corrected statistical inference for region detection with high-throughput assays. (English) Zbl 1428.62208

Summary: Scientists use high-dimensional measurement assays to detect and prioritize regions of strong signal in spatially organized domain. Examples include finding methylation-enriched genomic regions using microarrays, and active cortical areas using brain-imaging. The most common procedure for detecting potential regions is to group neighboring sites where the signal passed a threshold. However, one needs to account for the selection bias induced by this procedure to avoid diminishing effects when generalizing to a population. This article introduces pin-down inference, a model and an inference framework that permit population inference for these detected regions. Pin-down inference provides nonasymptotic point and confidence interval estimators for the mean effect in the region that account for local selection bias. Our estimators accommodate nonstationary covariances that are typical of these data, allowing researchers to better compare regions of different sizes and correlation structures. Inference is provided within a conditional one-parameter exponential family per region, with truncations that match the selection constraints. A secondary screening-and-adjustment step allows pruning the set of detected regions, while controlling the false-coverage rate over the reported regions. We apply the method to genomic regions with differing DNA-methylation rates across tissue. Our method provides superior power compared to other conditional and nonparametric approaches.

MSC:

62H11 Directional data; spatial statistics
62M30 Inference from spatial processes
PDFBibTeX XMLCite
Full Text: DOI DOI

References:

[1] Aryee, M.; Jaffe, A.; Corrada-Bravo, H.; Ladd-Acosta, C.; Feinberg, A.; Hansen, K.; Irizarry, R., Minfi: A Flexible and Comprehensive Bioconductor Package for the Analysis of Infinium DNA Methylation Microarrays, Bioinformatics, 30, 1363-1369 (2014)
[2] Becker, C.; Hagmann, J.; Müller, J.; Koenig, D.; Stegle, O.; Borgwardt, K.; Weigel, D., Spontaneous Epigenetic Variation in the Arabidopsis Thaliana Methylome, Nature, 480, 245-249 (2011)
[3] Benjamini, Y.; Speed, T. P., Summarizing and Correcting the GC Content Bias in High-Throughput Sequencing, Nucleic Acids Research, 40, e72 (2012)
[4] Benjamini, Y.; Yekutieli, D., False Discovery Rate-Adjusted Multiple Confidence Intervals for Selected Parameters, Journal of the American Statistical Association, 100, 71-81 (2005) · Zbl 1117.62302
[5] Berk, R.; Brown, L.; Buja, A.; Zhang, K.; Zhao, L., Valid Post-Selection Inference, The Annals of Statistics, 41, 802-837 · Zbl 1267.62080
[6] Bibikova, M.; Barnes, B.; Tsan, C.; Ho, V.; Klotzle, B.; Le, J. M.; Delano, D.; Zhang, L.; Schroth, G. P.; Gunderson, K. L.; Fan, J.; Shen, R., High Density DNA Methylation Array with Single CpG Site Resolution, Genomics, 98, 288-295 (2011)
[7] Bock, C.; Walter, J.; Paulsen, M.; Lengauer, T., Inter-Individual Variation of DNA Methylation and its Implications for Large-Scale Epigenome Mapping, Nucleic Acids Research, 36, e55 (2008)
[8] Cai, T. T.; Yuan, M., Rate-Optimal Detection of Very Short Signal Segments, arXiv:1407.2812 (2014)
[9] DiCiccio, T. J.; Romano, J. P., On Parametric Bootstrap Procedures for Second-Order Accurate Confidence Limits (1988)
[10] Efron, B., Bootstrap Confidence Intervals for a Class of Parametric Problems, Biometrika, 72, 45-58 (1985) · Zbl 0567.62025
[11] Eklund, A.; Nichols, T. E.; Knutsson, H., Cluster Failure: Why fMRI Inferences for Spatial Extent have Inflated False-Positive Rates, Proceedings of the National Academy of Sciences of the United States of America, 113, 7900-7905 (2016)
[12] Feinberg, A. P.; Tycko, B., The History of Cancer Epigenetics, Nature Reviews Cancer, 4, 143-53 (2004)
[13] Fithian, W.; Sun, D.; Taylor, J., Optimal Inference After Model Selection (2014)
[14] Friston, K. J.; Worsley, K. J.; Frackowiak, R.; Mazziotta, J. C.; Evans, A. C., Assessing the Significance of Focal Activations using their Spatial Extent, Human Brain Mapping, 1, 210-220 (1994)
[15] Hagler, D. J.; Saygin, A. P.; Sereno, M. I., Smoothing and Cluster Thresholding for Cortical Surface-Based Group Analysis of fMRI Data, Neuroimage, 33, 1093-1103 (2006)
[16] Hansen, K.; Langmead, B.; Irizarry, R., BSmooth: From Whole Genome Bisulfite Sequencing Reads to Differentially Methylated Regions, Genome Biology, 13, R83 (2012)
[17] Hayasaka, S.; Nichols, T. E., Validating Cluster Size Inference: Random Field and Permutation Methods, NeuroImage, 20, 2343-2356 (2003)
[18] Hayasaka, S.; Phan, K.; Liberzon, I.; Worsley, K. J.; Nichols, T. E., Nonstationary Cluster-Size Inference with Random Field And Permutation Methods, NeuroImage, 22, 676-687 (2004)
[19] Horrace, W. C., Some Results on the Multivariate Truncated Normal Distribution, Journal of Multivariate Analysis, 94, 209-221 (2005) · Zbl 1065.62098
[20] Jaenisch, R.; Bird, A., Epigenetic Regulation of Gene Expression: How the Genome Integrates Intrinsic and Environmental Signals, Nature Genetics, 33, 245-254 (2003)
[21] Jaffe, A. E.; Feinberg, A. P.; Irizarry, R. A.; Leek, J. T., 2012a. Significance Analysis and Statistical Dissection of Variably Methylated Regions, Biostatistics, 13, 166-178
[22] Jaffe, A. E.; Murakami, P.; Lee, H.; Leek, J. T.; Fallin, M. D.; Feinberg, A. P.; Irizarry, R. A., Bump Hunting to Identify Differentially Methylated Regions in Epigenetic Epidemiology Studies, International Journal of Epidemiology, 41, 200-209 (2012)
[23] Knijnenburg, T. A.; Ramsey, S. A.; Berman, B. P.; Kennedy, K. A.; Smit, A. F. A.; Wessels, L. F. A.; Laird, P. W.; Aderem, A.; Shmulevich, I., Multiscale Representation of Genomic Signals, Nature Methods, 11, 689-694 (2014)
[24] Kriegeskorte, N.; Simmons, W. K.; Bellgowan, P. S.; Baker, C. I., Circular Analysis in Systems Neuroscience: The Dangers of Double Dipping, Nature Neuroscience, 12, 535-540 (2009)
[25] Kuan, P. F.; Chiang, D. Y., Integrating Prior Knowledge in Multiple Testing under Dependence with Applications to Detecting Differential DNA Methylation, Biometrics, 68, 774-783 (2012) · Zbl 1270.62145
[26] Kundaje, A.; Meuleman, W.; Ernst, J.; Bilenky, M.; Yen, A.; Heravi-Moussavi, A.; Kheradpour, P.; Zhang, Z.; Wang, J.; Ziller, M. J., Integrative Analysis of 111 Reference Human Epigenomes, Nature, 518, 317-330 (2015)
[27] Lee, J. D.; Sun, D. L.; Sun, Y.; Taylor, J. E., Exact Post-Selection Inference, with Application to the Lasso, The Annals of Statistics, 44, 907-927 (2016) · Zbl 1341.62061
[28] Lee, L.-F, Consistent Estimation of a Multivariate Doubly Truncated or Censored Tobit Model (1981)
[29] Leek, J. T.; Scharpf, R. B.; Bravo, H. C.; Simcha, D.; Langmead, B.; Johnson, W. E.; Geman, D.; Baggerly, K.; Irizarry, R. A., Tackling the Widespread and Critical Impact of Batch Effects in High-Throughput Data, Nature Reviews Genetics, 11, 733-739 (2010)
[30] Lehmann, E. L.; Romano, J. P., Testing Statistical Hypotheses (2005), New York: Springer, New York · Zbl 1076.62018
[31] Lister, R.; Mukamel, E. A.; Nery, J. R.; Urich, M.; Puddifoot, C. A.; Johnson, N. D.; Lucero, J.; Huang, Y.; Dwork, A. J.; Schultz, M. D., Global Epigenomic Reconfiguration During Mammalian Brain Development, Science, 341 (2013)
[32] Lister, R.; Pelizzola, M.; Dowen, R. H.; Hawkins, R. D.; Hon, G.; Tonti-Filippini, J.; Nery, J. R.; Lee, L.; Ye, Z.; Ngo, Q.-M., Human DNA Methylomes at Base Resolution Show Widespread Epigenomic Differences, Nature, 462, 315-322 (2009)
[33] Lockhart, R.; Taylor, J.; Tibshirani, R. J.; Tibshirani, R., A Significance Test for the Lasso, The Annals of Statistics, 42, 413-468 (2014) · Zbl 1305.62254
[34] Pacis, A.; Tailleux, L.; Morin, A. M.; Lambourne, J.; MacIsaac, J. L.; Yotova, V.; Dumaine, A.; Danckaert, A.; Luca, F.; Grenier, J.-C., Bacterial Infection Remodels the DNA Methylation Landscape of Human Dendritic Cells, Genome Research, 25, 1801-1811 (2015)
[35] Pakman, A.; Paninski, L., Exact Hamiltonian Monte Carlo for Truncated Multivariate Gaussians, Journal of Computational and Graphical Statistics, 23, 518-542 (2014)
[36] Pedersen, B. S.; Schwartz, D. A.; Yang, I. V.; Kechris, K. J., Comb-p: Software for Combining, Analyzing, Grouping and Correcting Spatially Correlated P-Values, Bioinformatics, 28, 2986-2988 (2012)
[37] Razin, A.; Riggs, A. D., DNA Methylation and Gene Function, Science, 210, 604-610 (1980)
[38] Rinott, Y.; Scarsini, M., Total Positivity Order and the Normal Distribution, Journal of Multivariate Analysis, 97, 1251-1261 (2006) · Zbl 1094.60011
[39] Robertson, K. D., DNA Methylation and Human Disease, Nature Reviews Genetics, 6, 597-610 (2005)
[40] Schwartzman, A.; Gavrilov, Y.; Adler, R. J., Multiple Testing of Local Maxima for Detection of Peaks in 1D, The Annals of Statistics, 39, 3290-3319 (2011) · Zbl 1246.62173
[41] Schwartzman, A.; Jaffe, A.; Gavrilov, Y.; Meyer, C. A., Multiple Testing of Local Maxima for Detection of Peaks in ChIP-Seq Data, The Annals of Applied Statistics, 7, 471-494 (2013) · Zbl 1454.62555
[42] Sebat, J.; Lakshmi, B.; Troge, J.; Alexander, J.; Young, J.; Lundin, P.; Månér, S.; Massa, H.; Walker, M.; Chi, M., Large-Scale Copy Number Polymorphism in the Human Genome, Science, 305, 525-528 (2004)
[43] Siegmund, D.; Yakir, B.; Zhang, N., The False Discovery Rate for Scan Statistics, Biometrika, 98, 979-985 (2011) · Zbl 1228.62090
[44] Sommerfeld, M.; Sain, S.; Schwartzman, A., Confidence Regions for Spatial Excursion Sets From Repeated Random Field Observations, With an Application to Climate, Journal of the American Statistical Association, 113, 1328-1341 (2017)
[45] Song, L.; Crawford, G. E., DNase-seq: A High-Resolution Technique for Mapping Active Gene Regulatory Elements Across the Genome from Mammalian Cells, Cold Spring Harbor Protocols (2010)
[46] Sun, W.; Reich, B. J.; Tony Cai, T.; Guindani, M.; Schwartzman, A., False Discovery Control in Large-Scale Spatial Multiple Testing, Journal of the Royal Statistical Society, 77, 59-83 (2015) · Zbl 1414.62043
[47] Weinstein, A.; Fithian, W.; Benjamini, Y., Selection Adjusted Confidence Intervals with More Power to Determine the Sign, Journal of the American Statistical Association, 108, 165-176 (2013) · Zbl 06158333
[48] Woo, C.-W.; Krishnan, A.; Wager, T. D., Cluster-Extent based Thresholding in fMRI Analyses: Pitfalls and Recommendations, Neuroimage, 91, 412-419 (2014)
[49] Zhang, N. R.; Siegmund, D. O., Model Selection for High-Dimensional, Multi-Sequence Change-Point Problems, Statistica Sinica, 22, 1507-1538 (2012) · Zbl 1264.62079
[50] Zhang, Y.; Liu, T.; Meyer, C. A.; Eeckhoute, J.; Johnson, D. S.; Bernstein, B. E.; Nusbaum, C.; Myers, R. M.; Brown, M.; Li, W., Model-Based Analysis of ChIP-Seq (MACS), Genome biology, 9 (2008)
[51] Zhong, H.; Prentice, R. L., Bias-Reduced Estimators and Confidence Intervals for Odds Ratios in Genome-Wide Association Studies, Biostatistics, 9, 621-634 (2008) · Zbl 1437.62675
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.