×

Heterogeneity in DNA multiple alignments: modeling, inference, and applications in motif finding. (English) Zbl 1203.62184

Summary: Transcription factors bind sequence-specific sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBSs) is an important step for understanding gene regulation. Although sophisticated in modeling TFBSs and their combinatorial patterns, computational methods for TFBS detection and motif finding often make oversimplified homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for motif finding to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, we propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for different conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming recursions. Simulation studies and empirical evidence from biological data sets reveal the dramatic effect of background modeling on motif finding, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
62F15 Bayesian inference
92D10 Genetics and epigenetics
65C60 Computational problems in statistics (MSC2010)
60J20 Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.)
90C90 Applications of mathematical programming
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Auger, Algorithms for the optimal identification of segment neighborhoods, Bulletin of Mathematical Biology 51 pp 39– (1989) · Zbl 0658.92010 · doi:10.1007/BF02458835
[2] Baele, A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences, Systematic Biology 57 pp 675– (2008) · doi:10.1080/10635150802422324
[3] Bailey, Fitting a mixture model by expectation maximization to discover motifs in biopolymers, Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology 2 pp 28– (1994)
[4] Barash, Modeling dependencies in protein-DNA binding sites, Proceedings of the Seventh Annual International Conference on Research in Computational Molecular Biology (2003)
[5] Blaisdell, Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding, Journal of Molecular Evolution 21 pp 278– (1985) · doi:10.1007/BF02102360
[6] Boys, A Bayesian approach to DNA sequence segmentation, Biometrics 60 pp 573– (2004) · Zbl 1274.62728 · doi:10.1111/j.0006-341X.2004.00206.x
[7] Braun, Statistical methods for DNA sequence segmentation, Statistical Science 13 pp 142– (1998) · Zbl 0960.62121 · doi:10.1214/ss/1028905933
[8] Braun, Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation, Biometrika 87 pp 301– (2000) · Zbl 0963.62067 · doi:10.1093/biomet/87.2.301
[9] Churchill, Stochastic models for heterogeneous DNA sequences, Bulletin of Mathematical Biology 51 pp 79– (1989) · Zbl 0662.92012 · doi:10.1007/BF02458837
[10] Felsenstein, Evolutionary trees from DNA sequences: A maximum likelihood approach, Journal of Molecular Evolution 17 pp 368– (1981) · doi:10.1007/BF01734359
[11] Felsenstein, PHYLIP-phylogeny inference package (version 3.2), Cladistics 5 pp 164– (1989)
[12] Felsenstein, A hidden Markov model approach to variation among sites in rate of evolution, Molecular Biology and Evolution 13 pp 93– (1996) · doi:10.1093/oxfordjournals.molbev.a025575
[13] Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 pp 711– (1995) · Zbl 0861.62023 · doi:10.1093/biomet/82.4.711
[14] Gupta, Discovery of conserved sequence patterns using a stochastic dictionary model, Journal of the American Statistical Association 98 pp 55– (2003) · Zbl 1047.62107 · doi:10.1198/016214503388619094
[15] Huang, Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification, Journal of Computational Biology 11 pp 1– (2004) · doi:10.1089/106652704773416858
[16] Hwang, Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution, The Proceedings of the National Academy of Sciences of the United States of America 101 pp 13994– (2004) · doi:10.1073/pnas.0404142101
[17] Ji, Computational biology: Towards deciphering gene regulatory information in mammalian genomes, Biometrics 62 pp 645– (2006) · Zbl 1113.62136 · doi:10.1111/j.1541-0420.2006.00625.x
[18] Karolchik, The UCSC genome browser database: 2008 update, Nucleic Acids Research 36 pp D773– (2008) · Zbl 05438517 · doi:10.1093/nar/gkm966
[19] Lawrence, An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins 7 pp 41– (1990) · doi:10.1002/prot.340070105
[20] Lawrence, Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment, Science 262 pp 208– (1993) · doi:10.1126/science.8211139
[21] Li, Sampling motifs on phylogenetic trees, The Proceedings of the National Academy of Sciences of the United States of America 102 pp 9481– (2005) · Zbl 1135.92316 · doi:10.1073/pnas.0501620102
[22] Liu, Bayesian inference on biopolymer models, Bioinformatics 15 pp 38– (1999) · doi:10.1093/bioinformatics/15.1.38
[23] Liu, Bayesian models for multiple local sequence alignment and Gibbs sampling strategies, Journal of the American Statistical Association 90 pp 1156– (1995) · Zbl 0864.62076 · doi:10.1080/01621459.1995.10476622
[24] Liu, BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes, Pacific Symposium on Biocomputing 6 pp 127– (2001)
[25] Moses, Phylogenetic motif detection by expectation-maximization on evolutionary mixtures, Pacific Symposium on Biocomputing 9 pp 324– (2004)
[26] Pepe, Selecting differentially expressed genes from microarray experiments, Biometrics 59 pp 133– (2003) · Zbl 1210.62200 · doi:10.1111/1541-0420.00016
[27] Ray, CSMET: Comparative genomic motif detection via multi-resolution phylogenetic shadowing, PLoS Computational Biology 4 pp e1000090– (2008) · doi:10.1371/journal.pcbi.1000090
[28] Siddharthan, PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny, PLoS Computational Biology 1 pp 534– (2005) · doi:10.1371/journal.pcbi.0010067
[29] Siepel, Combining phylogenetic and hidden Markov models in biosequence analysis, Journal of Computational Biology 11 pp 413– (2004) · doi:10.1089/1066527041410472
[30] Sinha, PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences, BMC Bioinformatics 5 (2004)
[31] Stormo, Identifying protein-binding sites from unaligned DNA fragments, The Proceedings of the National Academy of Sciences of the United States of America 86 pp 1183– (1989) · doi:10.1073/pnas.86.4.1183
[32] Thompson, Decoding human regulatory circuits, Genome Research 14 pp 1967– (2004) · doi:10.1101/gr.2589004
[33] Wingender, TRANSFAC: An integrated system for gene expression regulation, Nucleic Acids Research 28 pp 316– (2000) · Zbl 05437328 · doi:10.1093/nar/28.1.316
[34] Xie, Cross-species de novo identification of CIS-regulatory modules with GibbsModule: Application to gene regulation in embryonic stem cells, Genome Research 18 pp 1325– (2008) · doi:10.1101/gr.072769.107
[35] Yang, A space-time process model for the evolution of DNA sequences, Genetics 139 pp 993– (1995)
[36] Zhou, Extracting sequence features to predict protein-DNA interactions: A comparative study, Nucleic Acids Research 36 pp 4137– (2008) · doi:10.1093/nar/gkn361
[37] Zhou, Coupling hidden Markov models for the discovery of CIS-regulatory modules in multiple species, Annals of Applied Statistics 1 pp 36– (2007) · Zbl 1129.62111 · doi:10.1214/07-AOAS103
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.