zbMATH — the first resource for mathematics

Subsampling methods for genomic inference. (English) Zbl 1220.62130
Summary: Large-scale statistical analysis of data sets associated with genome sequences plays an important role in modern biology. A key component of such statistical analyses is the computation of \(p\)-values and confidence bounds for statistics defined on the genome. Currently such computation is commonly achieved through ad hoc simulation measures. The method of randomization, which is at the heart of these simulation procedures, can significantly affect the resulting statistical conclusions. Most simulation schemes introduce a variety of hidden assumptions regarding the nature of the randomness in the data, resulting in a failure to capture biologically meaningful relationships. To address the need for a method of assessing the significance of observations within large scale genomic studies, where there often exists a complex dependency structure between observations, we propose a unified solution built upon a data subsampling approach. We propose a piecewise stationary model for genome sequences and show that the subsampling approach gives correct answers under this model. We illustrate the method on three simulation studies and two real data examples.

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
65C60 Computational problems in statistics (MSC2010)
92D10 Genetics and epigenetics
CATS; CisModule
Full Text: DOI
[1] Andrews, D. and Mallows, C. (1974). Scale mixtures of normal distributions. J. Roy. Statist. Soc. Ser. B 26 99-102. · Zbl 0282.62017
[2] Beran, R. (1988). Prepivoting test statistics: A bootstrap view of asymptotic refinements. J. Amer. Statist. Assoc. 83 687-697. · Zbl 0662.62024 · doi:10.2307/2289292
[3] Bernardi, G., Olofsson, B., Filipski, J., Zerial, M., Salinas, J., Cuny, G., Meunier-Rotival, M. and Rodier, F. (1985). The mosaic genome of warm-blooded vertebrates. Science 228 953-958.
[4] Bickel, P. J., Boley, N., Brown, J. B., Huang, H. and Zhang, N. R. (2010). Supplement to “Subsampling methods for genomic inference.” DOI: . · Zbl 1220.62130 · doi:10.1214/10-AOAS363 · dx.doi.org
[5] Bickel, P. J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and its application to confidence bounds for extreme percentiles. Statist. Sinica 18 967-985. · Zbl 05361940
[6] Bickel, P. J., Gotze, F. and van Zwet, W. R. (1997). Resampling fewer than n observations: Gains, losses, and remedies for losses. Statist. Sinica 1 1-31. · Zbl 0927.62043
[7] Birney, E. et al. (2007). Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447 799-816.
[8] Blakesley, R. W. et al. (2004). An intermediate grade of finished genomic sequence suitable for comparative analyses. Genome Res. 14 2235-2244.
[9] Braun, J. and Muller, H.-G. (1998). Statistical methods for DNA sequence segmentation. Statist. Sci. 13 142-162. · Zbl 0960.62121 · doi:10.1214/ss/1028905933
[10] Carter, N. (2007). Methods and strategies for analyzing copy number variation using DNA microarrays. Nature Genet. 39 S16-S21.
[11] Churchill, G. A. (1989). Stochastic models for heterogeneous genome sequences. Bull. Math. Biol. 51 79-94. · Zbl 0662.92012 · doi:10.1007/BF02458837
[12] Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Comput. Chem. 16 107-115. · Zbl 0752.92015 · doi:10.1016/0097-8485(92)80037-Z
[13] Das, D., Banerjee, N. and Zhang, M. Q. (2004). Interacting models of cooperative gene regulation. Proc. Natl. Acad. Sci. USA 101 16234-16239.
[14] Dedecker, J., Doukhan, P., Lang, G., Leon R., J. R., Louhichi, S. and Prieur, C. (2007). Weak Dependence: With Examples and Applications. Lecture Notes in Statist. 190 . Springer, New York. · Zbl 1165.62001 · doi:10.1007/978-0-387-69952-3
[15] Efron, B. (1981). Nonparametric standard errors and confidence intervals. With discussion and a reply by the author. Canad. J. Statist. 9 139-172. · Zbl 0482.62034 · doi:10.2307/3314608
[16] Fickett, J. W., Torney, D. C. and Wolf, D. R. (1992). Base compositional structure of genomes. Genomics 13 1056-1064.
[17] Fu, Y.-X. and Curnow, R.-N. (1990). Maximum likelihood estimation of multiple change-points. Biometrika 77 563-573. · Zbl 0724.62025 · doi:10.1093/biomet/77.3.563
[18] Gotze, F. and Rackauskas, A. (2001). Adaptive choice of bootstrap sample sizes. In State of the Art in Probability and Statistics (Leiden, 1999) 286-309. Lecture Notes Monogr. Ser. 36 . Inst. Math. Statist., Beachwood, OH. · Zbl 1373.62177 · doi:10.1214/lnms/1215090074
[19] Gupta, M. and Liu, J. S. (2005). De novo cis-regulatory module elicitation for eukaryotic genomes. Proc. Natl. Acad. Sci. USA 102 7079-7084.
[20] Hall, P. (1992). The Bootstrap and Edgeworth Expansion . Springer, New York. · Zbl 0744.62026
[21] Huang, H., Kao, M. C., Zhou, X., Liu, J. S. and Wong, W. H. (2004). Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification. J. Comput. Biol. 11 1-14.
[22] James, B., James, K. L. and Siegmund, D. (1987). Tests for a change-point. Biometrika 74 71-84. · Zbl 0632.62021 · doi:10.1093/biomet/74.1.71
[23] Kato, M., Hata, N., Banerjee, N., Futcher, B. and Zhang, M. Q. (2004). Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol. 5 R56.
[24] Künsch, H. (1989). The jackknife and the bootstrap for general stationary observations. Ann. Statist. 17 1217-1241. · Zbl 0684.62035 · doi:10.1214/aos/1176347265
[25] Letson, D. and McCullough, B. D. (1998). Better confidence intervals: The double bootstrap with no pivot. Amer. J. Agr. Econ. 80 552-559.
[26] Li, W., Stolovitzky, G., Bernaola-Galván, P. and Oliver, J. L. (1998). Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. Genome Res. 8 916-928.
[27] Li, W., Bernaola-Galván, P., Haghighi, F. and Grosse, I. (2002). Applications of recursive segmentation to the analysis of DNA sequences. Comput. Chem. 26 491-510.
[28] Margulies, E. H. et al. (2007). Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17 760-774.
[29] Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5 557-572. · Zbl 1155.62478 · doi:10.1093/biostatistics/kxh008
[30] Politis, D. and Romano, J. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Ann. Statist. 22 2031-2050. · Zbl 0828.62044 · doi:10.1214/aos/1176325770
[31] Politis, D., Romano, J. and Wolf, M. (1999). Subsampling . Springer, New York. · Zbl 0931.62035
[32] Redon, R. et al. (2006). Global variation in copy number in the human genome. Nature 444 444-454.
[33] Thisted, R. and Efron, B. (1987). Did Shakespeare write a newly-discovered poem? Biometrika 74 445-455. · Zbl 0635.62115 · doi:10.1093/biomet/74.3.445
[34] Venkatraman, S. (1992). Consistency results in multiple change-point problems. Ph.D. dissertation, Stanford Univ.
[35] Vostrikova, L. J. (1981). Detecting disorder in multidimensional random process. Sov. Math. Dokl. 24 55-59. · Zbl 0487.62072
[36] Yu, H., Yoo, A. S. and Greenwald, I. (2004). Cluster Analyzer for Transcription Sites (CATS): A C++-based program for identifying clustered transcription factor binding sites. Bioinformatics 20 1198-1200.
[37] Zhang, C., Xuan, Z., Mandel, G. and Zhang, M. Q. (2006). A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome. Nucleic Acid Res. 34 2238-2246.
[38] Zhou, Q. and Wong, W. H. (2004). CisModule: De Novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc. Natl. Acad. Sci. USA 101 12114-12119.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.