Computational biology: toward deciphering gene regulatory information in mammalian genomes.(English)Zbl 1113.62136

Summary: Computational biology is a rapidly evolving area where methodologies from computer science, mathematics and statistics are applied to address fundamental problems in biology. The study of gene regulatory information is a central problem in current computational biology. This article reviews recent developments of statistical methods related to this field. Starting from microarray gene selection, we examine methods for finding transcription factor binding motifs and cis-regulatory modules in coregulated genes, and methods for utilizing information from cross-species comparisons and ChIP-chip experiments. The ultimate understanding of cis-regulatory logic in mammalian genomes may require the integration of information collected from all these steps.

MSC:

 62P10 Applications of statistics to biology and medical sciences; meta analysis 92C40 Biochemistry, molecular biology

Software:

BioProspector; TileMap; PipMaker; CisModule; PhyME
Full Text:

References:

 [1] Bailey , T. L. Elkan , C. 1994 Fitting a mixture model by expectation maximization to discover motifs in biopolymers Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology 28 36 AAAI Press [2] Baldi, A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes, Bioinformatics 17 pp 509– (2001) [3] Benjamini, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B 57 pp 289– (1995) · Zbl 0809.62014 [4] Benjamini, The control of the false discovery rate in multiple testing under dependency, Annals of Statistics 29 pp 1165– (2001) · Zbl 1041.62061 [5] Berman , B. P. Nibu , Y. Pfeiffer , B. D. Tomancak , P. Celniker , S. E. Levine , M. Rubin , G. M. Eisen , M. B. 2002 Exploiting transcription factor binding site clustering to identify cis -regulatory modules involved in pattern formation in the Drosophila genome Proceedings of the National Academy of Sciences of the United States of America 99 757 762 [6] Bolstad, A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics 19 pp 185– (2003) [7] Bussemaker , H. J. Li , H. Siggia , E. D. 2000 Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis Proceedings of the National Academy of Sciences of the United States of America 97 10096 10100 [8] Bussemaker, Regulatory element detection using correlation with expression, Nature Genetics 27 pp 167– (2001) [9] Cawley, Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs, Cell 116 pp 499– (2004) [10] Conlon , E. M. Liu , X. S. Lieb , J. D. Liu , J. S. 2003 Integrating regulatory motif discovery and genome-wide expression analysis Proceedings of the National Academy of Sciences of the United States of America 100 3339 3344 [11] Cui, Statistical tests for differential expression in cDNA microarray experiments, Genome Biology 4 pp 210– (2003) [12] Cui, Improved statistical tests for differential gene expression by shrinking variance components estimates, Biostatistics 6 pp 59– (2005) · Zbl 1069.62090 [13] Davidson, Genomic Regulatory Systems (2001) [14] Dempster, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B 34 pp 1– (1977) · Zbl 0364.62022 [15] Dudoit, A prediction-based resampling method for estimating the number of clusters in a data set, Genome Biology 3 pp 0036.1– (2002) [16] Dudoit, Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Statistica Sinica 12 pp 111– (2002) · Zbl 1004.62088 [17] Dudoit, Multiple hypothesis testing in microarray experiments, Statistical Science 18 pp 71– (2003) · Zbl 1048.62099 [18] Eisen , M. B. Spellman , P. T. Brown , P. O. Botstein , D. 1998 Cluster analysis and display of genome-wide expression patterns Proceedings of the National Academy of Sciences of the United States of America 98 14863 14868 [19] Elnitski, Distinguishing regulatory DNA from neutral sites, Genome Research 13 pp 64– (2003) [20] Fraley, How many clusters? Which clustering methods? Answers via model-based cluster analysis, Computer Journal 41 pp 578– (1998) · Zbl 0920.68038 [21] Frith, Detection of cis-element clusters in higher eukaryotic DNA, Bioinformatics 17 pp 878– (2001) [22] Gelfand, Sampling-based approaches to calculating marginal densities, Journal of the American Statistical Association 85 pp 398– (1990) · Zbl 0702.62020 [23] Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, IEEE Transactions on Pattern Analysis and Machine Intelligence 6 pp 721– (1984) · Zbl 0573.62030 [24] Geyer , C. J. 1991 Markov chain Monte Carlo maximum likelihood Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface E. M. Keramides 156 163 [25] Gumucio, Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human {$$\gamma$$} and globin genes, Molecular and Cellular Biology 12 pp 4919– (1992) [26] Gupta, Discovery of conserved sequence patterns using a stochastic dictionary model, Journal of the American Statistical Association 98 pp 55– (2003) · Zbl 1047.62107 [27] Gupta , M. Liu , J. S. 2005 De novo cis -regulatory module elicitation for eukaryotic genomes Proceedings of the National Academy of Sciences of the United States of America 102 7079 7084 [28] Harbison, Transcriptional regulatory code of a eukaryotic genome, Nature 431 pp 99– (2004) [29] Hardison, Conserved noncoding sequences are reliable guides to regulatory elements, Trends in Genetics 16 pp 369– (2000) [30] Hardison, Long human-mouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome, Genome Research 7 pp 959– (1997) [31] Hastie, The Elements of Statistical Learning (2001) · Zbl 0973.62007 [32] Hong, A boosting approach for motif modeling using ChIP-chip data, Bioinformatics 21 pp 2536– (2005) [33] Huang, Determination of local statistical significance of patterns in Markov sequences with application to promoter element identification, Journal of Computational Biology 11 pp 1– (2004) [34] Hughes, Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae, Journal of Molecular Biology 296 pp 1205– (2000) [35] International HapMap Consortium, A haplotype map of the human genome, Nature 437 pp 1299– (2005) [36] International Human Genome Sequencing Consortium, Finishing the euchromatic sequence of the human genome, Nature 431 pp 931– (2004) [37] Irizarry, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 pp 249– (2003) · Zbl 1141.62348 [38] Irizarry, Multiple-laboratory comparison of microarray platforms, Nature Methods 2 pp 345– (2005) [39] James , W. Stein , C. 1961 Estimation of quadratic loss Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability 1 361 380 University of California Press [40] Jensen, Computational discovery of gene regulatory binding motifs: A Bayesian perspective, Statistical Science 19 pp 188– (2004) · Zbl 1057.62101 [41] Ji, TileMap: Create chromosomal map of tiling array hybridizations, Bioinformatics 21 pp 3629– (2005) [42] Kampa, Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22, Genome Research 14 pp 331– (2004) [43] Kaplan, Ab initio prediction of transcription factor targets using structural knowledge, PLoS Computational Biology 1 pp e1– (2005) [44] Kapranov, Large-scale transcriptional activity in chromosomes 21 and 22, Science 296 pp 916– (2002) [45] Keles , S. van der Laan , M. J. Dudoit , S. Cawley , S. E. 2004 Multiple testing methods for ChIP-chip high density oligonucleotide array data U.C. Berkeley Division of Biostatistics Working Paper Series [46] Kolbe, Regulatory potential scores from genome-wide three-way alignments of human, mouse, and rat, Genome Research 14 pp 700– (2004) [47] Kou, Equi-energy sampling and its application to statistical inference and statistical mechanics, Annals of Statistics (2006) · Zbl 1246.82054 [48] Lawrence, An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences, Proteins 7 pp 41– (1990) [49] Lawrence, Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment, Science 262 pp 208– (1993) [50] Li, Significance of interspecies matches when evolutionary rate varies, Journal of Computational Biology 10 pp 537– (2003) [51] Li , C. Wong , W. H. 2001 Model-based analysis of oligonucleotide arrays: Expression index computation and outlier detection Proceedings of the National Academy of Sciences of the United States of America 98 31 36 · Zbl 0990.62091 [52] Li , X. Wong , W. H. 2005 Sampling motifs on phylogenetic trees Proceedings of the National Academy of Sciences of the United States of America 102 9481 9486 · Zbl 1135.92316 [53] Li, A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences, Bioinformatics 21 (suppl. 1) pp i274– (2005) [54] Liu, The collapsed Gibbs sampler with applications to a gene regulation problem, Journal of the American Statistical Association 89 pp 958– (1994) · Zbl 0804.62033 [55] Liu, Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes, Biometrika 81 pp 27– (1994) · Zbl 0811.62080 [56] Liu, Bayesian models for multiple local sequence alignment and Gibbs sampling strategies, Journal of the American Statistical Association 90 pp 1156– (1995) · Zbl 0864.62076 [57] Liu, Markovian structures in biological sequence alignments, Journal of the American Statistical Association 94 pp 1– (1999) [58] Liu, BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes, Pacific Symposium on Biocomputing 6 pp 127– (2001) [59] Liu, An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments, Nature Biotechnology 20 pp 835– (2002) [60] Liu, Eukaryotic regulatory element conservation analysis and identification using comparative genomics, Genome Research 14 pp 451– (2004) [61] Lockhart, Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotechnology 14 pp 1675– (1996) [62] Lönnstedt, Replicated microarray data, Statistica Sinica 12 pp 31– (2002) [63] Loots, Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons, Science 288 pp 136– (2000) [64] Loots, rVista for comparative sequence-based discovery of functional transcription factor binding sites, Genome Research 12 pp 832– (2002) [65] Miller, Comparative genomics, Annual Review of Genomics and Human Genetics 5 pp 15– (2004) [66] Moses, Phylogenetic motif detection by expectation-maximization on evolutionary mixtures, Pacific Symposium on Biocomputing 9 pp 324– (2004) [67] Mouse Genome Sequencing Consortium, Initial sequencing and comparative analysis of the mouse genome, Nature 420 pp 520– (2002) [68] Neuwald, Gibbs motif sampling: Detection of bacterial outer membrane protein repeats, Protein Science 4 pp 1618– (1995) [69] Newton, On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data, Journal of Computational Biology 8 pp 37– (2001) [70] Newton, Detecting differential gene expression with a semiparametric hierarchical mixture method, Biostatistics 5 pp 155– (2004) · Zbl 1096.62124 [71] Parmigiani, A cross-study comparison of gene expression studies for the molecular classification of lung cancer, Clinical Cancer Research 10 pp 2922– (2004) [72] Prakash, Motif discovery in heterogeneous sequence data, Pacific Symposium on Biocomputing 9 pp 348– (2004) [73] Quandt, MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data, Nucleic Acids Research 23 pp 4878– (1995) [74] Reiner, Identifying differentially expressed genes using false discovery rate controlling procedures, Bioinformatics 19 pp 368– (2003) [75] Ren, Genome-wide location and function of DNA binding proteins, Science 290 pp 2306– (2000) [76] Roth, Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation, Nature Biotechnology 16 pp 939– (1998) [77] Schena, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270 pp 467– (1995) [78] Schneider, Sequence logos: A new way to display consensus sequences, Nucleic Acids Research 18 pp 6097– (1990) [79] Schwartz, PipMaker-A web server for aligning two genomic DNA sequences, Genome Research 10 pp 577– (2000) [80] Siddharthan, PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny, PLoS Computational Biology 1 pp e67– (2005) [81] Siepel, Combining phylogenetic and hidden Markov models in biosequence analysis, Journal of Computational Biology 11 pp 413– (2004) [82] Sinha, Discovery of novel transcription factor binding sites by statistical overrepresentation, Nucleic Acids Research 30 pp 5549– (2002) [83] Sinha, A probabilistic method to detect regulatory modules, Bioinformatics 19 (suppl. 1) pp i292– (2003) [84] Sinha, PhyME: A probabilistic algorithm for finding motifs in sets of orthologous sequences, BMC Bioinformatics 5 pp 170– (2004) · Zbl 05325935 [85] Smyth, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, Statistical Applications in Genetics and Molecular Biology 3 pp 1– (2004) · Zbl 1038.62110 [86] Speed, Statistical Analysis of Gene Expression Microarray Data (2003) · Zbl 1108.62331 [87] Staden, Methods for calculating the probabilities of finding patterns in sequences, Computer Applications in the Biosciences 5 pp 89– (1989) [88] Storey, A direct approach to false discovery rates, Journal of the Royal Statistical Society, Series B 64 pp 479– (2002) · Zbl 1090.62073 [89] Storey, The positive false discovery rate: A Bayesian interpretation and the q-value, Annals of Statistics 31 pp 2013– (2003) · Zbl 1042.62026 [90] Storey , J. D. Tibshirani , R. 2003 Statistical significance for genomewide studies Proceedings of the National Academy of Sciences of the United States of America 100 9440 9445 · Zbl 1130.62385 [91] Storey, Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach, Journal of the Royal Statistical Society, Series B 66 pp 187– (2004) · Zbl 1061.62110 [92] Stormo, DNA binding sites: Representation and discovery, Bioinformatics 16 pp 16– (2000) [93] Stormo , G. D. Hartzell , G. W. III 1989 Identifying protein-binding sites from unaligned DNA fragments Proceedings of the National Academy of Sciences of the United States of America 86 1183 1187 [94] Tagle, Embryonic and {$$\gamma$$} globin genes of a prosimian primate (Galago crassicaudatus): Nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints, Journal of Molecular Biology 203 pp 439– (1988) [95] Tanner, The calculation of posterior distributions by data augmentation (with discussion), Journal of the American Statistical Association 82 pp 528– (1987) · Zbl 0619.62029 [96] Thompson, Gibbs recursive sampler: Finding transcription factor binding sites, Nucleic Acids Research 31 pp 3580– (2003) · Zbl 05437137 [97] Thompson, Decoding human regulatory circuits, Genome Research 14 pp 1967– (2004) [98] Tibshirani , R. Walther , G. Botstein , D. Brown , P. O. 2001 Cluster validation by prediction strength [99] Tompa, Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology 23 pp 137– (2005) [100] Tseng, Tight clustering: A resampling-based approach for identifying stable and tight patterns in data, Biometrics 61 pp 10– (2005) · Zbl 1077.62049 [101] Tusher , V. G. Tibshirani , R. Chu , G. 2001 Significance analysis of microarrays applied to the ionizing radiation response Proceedings of the National Academy of Sciences of the United States of America 98 5116 5121 · Zbl 1012.92014 [102] van Steensel, Mapping of genetic and epigenetic regulatory networks using microarrays, Nature Genetics 37 (suppl.) pp S18– (2005) [103] Wang, Combining phylogenetic data with co-regulated genes to identify regulatory motifs, Bioinformatics 19 pp 2369– (2003) [104] Wang , W. Cherry , J. M. Nochomovitz , Y. Jolly , E. Botstein , D. Li , H. 2005 Inference of combinatorial regulation in yeast transcriptional networks: A case study of sporulation Proceedings of the National Academy of Sciences of the United States of America 102 1998 2003 [105] Wasserman, Identification of regulatory regions which confer muscle-specific gene expression, Journal of Molecular Biology 278 pp 167– (1998) [106] Wasserman, Human-mouse genome comparisons to locate regulatory sites, Nature Genetics 26 pp 225– (2000) [107] Wright, A random variance model for detection of differential gene expression in small microarray experiments, Bioinformatics 19 pp 2448– (2003) [108] Yang, Normalization for cDNA microarray data: A robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Research 30 pp e14– (2002) [109] Yuh, Genomic cis-regulatory logic: Experimental and computational analysis of a sea urchin gene, Science 279 pp 1896– (1998) [110] Zhou , Q. Wong , W. H. 2004 CisModule: De novo discovery of cis -regulatory modules by hierarchical mixture modeling Proceedings of the National Academy of Sciences of the United States of America 101 12114 12119 [111] Zhou , Q. Wong , W. H. 2005 Coupling hidden Markov models in multiple species for the discovery of cis-regulatory modules and motifs http://www.stanford.edu/group/wonglab/software.html [112] Zhou, Functional annotation and network reconstruction through cross-platform integration of microarray data, Nature Biotechnology 23 pp 238– (2005)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.