×

Convergence rate of Markov chain methods for genomic motif discovery. (English) Zbl 1347.62048

Summary: We analyze the convergence rate of a simplified version of a popular Gibbs sampling method used for statistical discovery of gene regulatory binding motifs in DNA sequences. This sampler satisfies a very strong form of ergodicity (uniform). However, we show that, due to multimodality of the posterior distribution, the rate of convergence often decreases exponentially as a function of the length of the DNA sequence. Specifically, we show that this occurs whenever there is more than one true repeating pattern in the data. In practice there are typically multiple such patterns in biological data, the goal being to detect the most well-conserved and frequently-occurring of these. Our findings match empirical results, in which the motif-discovery Gibbs sampler has exhibited such poor convergence that it is used only for finding modes of the posterior distribution (candidate motifs) rather than for obtaining samples from that distribution. Ours are some of the first meaningful bounds on the convergence rate of a Markov chain method for sampling from a multimodal posterior distribution, as a function of statistical quantities like the number of observations.

MSC:

62F15 Bayesian inference
60J10 Markov chains (discrete-time Markov processes on discrete state spaces)
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

BioProspector; SSS
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Andrieu, C., Doucet, A. and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 72 269-342. · doi:10.1111/j.1467-9868.2009.00736.x
[2] Belloni, A. and Chernozhukov, V. (2009). On the computational complexity of MCMC-based estimators in large samples. Ann. Statist. 37 2011-2055. · Zbl 1175.65015 · doi:10.1214/08-AOS634
[3] Berk, R. H. (1966). Limiting behavior of posterior distributions when the model is incorrect. Ann. Math. Statist. 37 51-58. · Zbl 0151.23802 · doi:10.1214/aoms/1177699597
[4] Bhatnagar, N. and Randall, D. (2004). Torpid mixing of simulated tempering on the Potts model. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms 478-487. ACM, New York. · Zbl 1318.82023
[5] Borgs, C., Chayes, J. T., Frieze, A., Kim, J. H., Tetali, P., Vigoda, E. and Vu, V. H. (1999). Torpid mixing of some MCMC algorithms in statistical physics. In Proceedings of the 40 th IEEE Symposium on Foundations of Computer Science 218-229. IEEE, New York.
[6] Del Moral, P., Doucet, A. and Jasra, A. (2006). Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 411-436. · Zbl 1105.62034 · doi:10.1111/j.1467-9868.2006.00553.x
[7] Diaconis, P. and Saloff-Coste, L. (1993). Comparison theorems for reversible Markov chains. Ann. Appl. Probab. 3 696-730. · Zbl 0799.60058 · doi:10.1214/aoap/1177005359
[8] Diaconis, P. and Saloff-Coste, L. (1996). Logarithmic Sobolev inequalities for finite Markov chains. Ann. Appl. Probab. 6 695-750. · Zbl 0867.60043 · doi:10.1214/aoap/1034968224
[9] Diaconis, P. and Stroock, D. (1991). Geometric bounds for eigenvalues of Markov chains. Ann. Appl. Probab. 1 36-61. · Zbl 0731.60061 · doi:10.1214/aoap/1177005980
[10] Fort, G., Moulines, E., Roberts, G. O. and Rosenthal, J. S. (2003). On the geometric ergodicity of hybrid samplers. J. Appl. Probab. 40 123-146. · Zbl 1028.65002 · doi:10.1239/jap/1044476831
[11] Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457-472. · Zbl 1386.65060
[12] Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern. Anal. Mach. Intell. 6 721-741. · Zbl 0573.62030 · doi:10.1109/TPAMI.1984.4767596
[13] Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bayesian Statistics , 4 ( PeñíScola , 1991) (J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 169-193. Oxford Univ. Press, New York.
[14] Green, P. J. and Richardson, S. (2002). Hidden Markov models and disease mapping. J. Amer. Statist. Assoc. 97 1055-1070. · Zbl 1046.62117 · doi:10.1198/016214502388618870
[15] Hans, C., Dobra, A. and West, M. (2007). Shotgun stochastic search for “large \(p\)” regression. J. Amer. Statist. Assoc. 102 507-516. · Zbl 1134.62398 · doi:10.1198/016214507000000121
[16] Jarner, S. F. and Hansen, E. (2000). Geometric ergodicity of Metropolis algorithms. Stochastic Process. Appl. 85 341-361. · Zbl 0997.60070 · doi:10.1016/S0304-4149(99)00082-4
[17] Jensen, S. T., Liu, X. S., Zhou, Q. and Liu, J. S. (2004). Computational discovery of gene regulatory binding motifs: A Bayesian perspective. Statist. Sci. 19 188-204. · Zbl 1057.62101 · doi:10.1214/088342304000000107
[18] Johnson, A. A. and Jones, G. L. (2010). Gibbs sampling for a Bayesian hierarchical general linear model. Electron. J. Stat. 4 313-333. · Zbl 1329.62336 · doi:10.1214/09-EJS515
[19] Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statist. Sci. 16 312-334. · Zbl 1127.60309 · doi:10.1214/ss/1015346317
[20] Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchical random effects model. Ann. Statist. 32 784-817. · Zbl 1048.62069 · doi:10.1214/009053604000000184
[21] Kamatani, K. (2011). Weak consistency of Markov chain Monte Carlo methods. Technical report. Available at . · Zbl 1294.65007
[22] Kellis, M., Patterson, N., Birren, B., Berger, B. and Lander, E. S. (2004). Methods in comparative genomics: Genome correspondence, gene identification and regulatory motif discovery. J. Comput. Biol. 11 319-355.
[23] Kullback, S. (1959). Information Theory and Statistics . Wiley, New York. · Zbl 0088.10406
[24] Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993). Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262 208-214.
[25] Liang, F. and Wong, W. H. (2000). Evolutionary Monte Carlo: Applications to \(C_p\) model sampling and change point problem. Statist. Sinica 10 317-342. · Zbl 1054.65500
[26] Liu, J. S. (1994). The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. J. Amer. Statist. Assoc. 89 958-966. · Zbl 0804.62033 · doi:10.2307/2290921
[27] Liu, X., Brutlag, D. L. and Liu, J. S. (2001). BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symposium on Biocomputing 6 127-138.
[28] Liu, J. S., Neuwald, A. F. and Lawrence, C. E. (1995). Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Amer. Statist. Assoc. 90 1156-1170. · Zbl 0864.62076 · doi:10.2307/2291508
[29] Liu, J. S., Wong, W. H. and Kong, A. (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. J. Roy. Statist. Soc. Ser. B 57 157-169. · Zbl 0811.60056
[30] Madras, N. and Randall, D. (2002). Markov chain decomposition for convergence rate analysis. Ann. Appl. Probab. 12 581-606. · Zbl 1017.60080 · doi:10.1214/aoap/1026915617
[31] Madras, N. and Zheng, Z. (2003). On the swapping algorithm. Random Structures Algorithms 22 66-97. · Zbl 1013.60074 · doi:10.1002/rsa.10066
[32] Mira, A. (2001). Ordering and improving the performance of Monte Carlo Markov chains. Statist. Sci. 16 340-350. · Zbl 1127.60312 · doi:10.1214/ss/1015346319
[33] Mossel, E. and Vigoda, E. (2006). Limitations of Markov chain Monte Carlo algorithms for Bayesian inference of phylogeny. Ann. Appl. Probab. 16 2215-2234. · Zbl 1121.60078 · doi:10.1214/105051600000000538
[34] Neuwald, A. F., Liu, J. S. and Lawrence, C. E. (1995). Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci. 4 1618-1632.
[35] Peskun, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika 60 607-612. · Zbl 0271.62041 · doi:10.1093/biomet/60.3.607
[36] Roberts, G. O. and Rosenthal, J. S. (2004). General state space Markov chains and MCMC algorithms. Probab. Surv. 1 20-71. · Zbl 1189.60131 · doi:10.1214/154957804100000024
[37] Roberts, G. O. and Sahu, S. K. (2001). Approximate predetermined convergence properties of the Gibbs sampler. J. Comput. Graph. Statist. 10 216-229. · Zbl 04567020 · doi:10.1198/10618600152627915
[38] Rosenthal, J. S. (1995). Minorization conditions and convergence rates for Markov chain Monte Carlo. J. Amer. Statist. Assoc. 90 558-566. · Zbl 0824.60077 · doi:10.2307/2291067
[39] Rosenthal, J. S. (1996). Analysis of the Gibbs sampler for a model related to James-Stein estimators. Statist. Comput. 6 269-275.
[40] Roth, F. P., Hughes, J. D., Estep, P. W. and Church, G. M. (1998). Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotechnol. 16 939-945.
[41] Sinclair, A. (1992). Improved bounds for mixing rates of Markov chains and multicommodity flow. Combin. Probab. Comput. 1 351-370. · Zbl 0801.90039 · doi:10.1017/S0963548300000390
[42] Tierney, L. (1998). A note on Metropolis-Hastings kernels for general state spaces. Ann. Appl. Probab. 8 1-9. · Zbl 0935.60053 · doi:10.1214/aoap/1027961031
[43] Woodard, D. B. and Rosenthal, J. S. (2013). Supplement to “Convergence rate of Markov chain methods for genomic motif discovery.” . · Zbl 1347.62048
[44] Woodard, D. B., Schmidler, S. C. and Huber, M. (2009a). Conditions for rapid mixing of parallel and simulated tempering on multimodal distributions. Ann. Appl. Probab. 19 617-640. · Zbl 1171.65008 · doi:10.1214/08-AAP555
[45] Woodard, D. B., Schmidler, S. C. and Huber, M. (2009b). Sufficient conditions for torpid mixing of parallel and simulated tempering. Electron. J. Probab. 14 780-804. · Zbl 1189.65021 · doi:10.1214/EJP.v14-638
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.