×

Bayesian clustering of replicated time-course gene expression data with weak signals. (English) Zbl 1283.62050

Summary: To identify novel dynamic patterns of gene expressions, we develop a statistical method to cluster noisy measurements of gene expressions collected from multiple replicates at multiple time points, with an unknown number of clusters. We propose a random-effects mixture model coupled with a Dirichlet-process prior for clustering. The mixture model formulation allows for probabilistic cluster assignments. The random-effects formulation allows for attributing the total variability in the data to the sources that are consistent with the experimental design, particularly when the noise level is high and the temporal dependence is not strong. The Dirichlet-process prior induces a prior distribution on the partitions and helps to estimate the number of clusters (or mixture components) from the data.
We further tackle two challenges associated with Dirichlet-process prior-based methods. One is efficient sampling. We develop a novel Metropolis-Hastings Markov Chain Monte Carlo (MCMC) procedure to sample the partitions. The other is efficient use of the MCMC samples in forming clusters. We propose a two-step procedure for posterior inference, which involves resampling and relabeling, to estimate the posterior allocation probability matrix. This matrix can be directly used in cluster assignments, while describing the uncertainty in clustering. We demonstrate the effectiveness of our model and sampling procedure through simulated data. Applying our method to a real data set collected from Drosophila adult muscle cells after five-minute Notch activation, we identify 14 clusters of different transcriptional responses among 163 differentially expressed genes, which provides novel insights into underlying transcriptional mechanisms in the Notch signaling pathway. The algorithm developed here is implemented in the R package DIRECT, available on CRAN.

MSC:

62F15 Bayesian inference
62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62-04 Software, source code, etc. for problems pertaining to statistics
65C60 Computational problems in statistics (MSC2010)

Software:

mclust; DIRECT; R; CRAN
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2 1152-1174. · Zbl 0335.60034 · doi:10.1214/aos/1176342871
[2] Beal, M. J. and Krishnamurthy, P. (2006). Gene expression time course clustering with countably infinite hidden Markov models. In Proc. Conference on Uncertainty in Artificial Intelligence .
[3] Booth, J. G., Casella, G. and Hobert, J. P. (2008). Clustering using objective functions and stochastic search. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 119-139. · Zbl 1400.62128 · doi:10.1111/j.1467-9868.2007.00629.x
[4] Bray, S. J. (2006). Notch signalling: A simple pathway becomes complex. Nat. Rev. Mol. Cell Bio. 7 678-689.
[5] Celeux, G., Martin, O. and Lavergne, C. (2005). Mixture of linear mixed models for clustering gene expression profiles from repeated microarray experiments. Stat. Model. 5 1-25. · Zbl 1111.62103 · doi:10.1191/1471082X05st096oa
[6] Cooke, E. J., Savage, R. S., Kirk, P. D. W., Darkins, R. and Wild, D. L. (2011). Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinformatics 12 399.
[7] Dhavala, S. S., Datta, S., Mallick, B. K., Carroll, R. J., Khare, S., Lawhon, S. D. and Adams, L. G. (2010). Bayesian modeling of MPSS data: Gene expression analysis of bovine Salmonella infection. J. Amer. Statist. Assoc. 105 956-967. · Zbl 1390.62251 · doi:10.1198/jasa.2010.ap08327
[8] Dunson, D. B. (2010). Nonparametric Bayes applications to biostatistics. In Bayesian Nonparametrics (N. L. Hjort, C. Holmes, P. Müller and S. G. Walker, eds.) Cambridge Series on Statistical and Probabilistic Mathematics 28 223-273. Cambridge Univ. Press, Cambridge. · doi:10.1017/CBO9780511802478.008
[9] Elowitz, M. B., Levine, A. J., Siggia, E. D. and Swain, P. S. (2002). Stochastic gene expression in a single cell. Science 297 1183-1186.
[10] Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577-588. · Zbl 0826.62021 · doi:10.2307/2291069
[11] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209-230. · Zbl 0255.62037 · doi:10.1214/aos/1176342360
[12] Fox, E. B. (2009). Bayesian nonparametric learning of complex dynamical phenomena. Ph.D. thesis, MIT, Cambridge, MA.
[13] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc. 97 611-631. · Zbl 1073.62545 · doi:10.1198/016214502760047131
[14] Fraley, C. and Raftery, A. E. (2006). MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, Dept. Statistics, Univ. Washington, Seattle, WA.
[15] Fu, A. Q., Russell, S., Bray, S. J. and Tavaré, S. (2013). Supplement to “Bayesian clustering of replicated time-course gene expression data with weak signals.” . · Zbl 1283.62050
[16] Green, P. J. (2010). Colouring and breaking sticks: Random distributions and heterogeneous clustering. In Probability and Mathematical Genetics (N. H. Bingham and C. M. Goldie, eds.). London Mathematical Society Lecture Note Series 378 319-344. Cambridge Univ. Press, Cambridge. · Zbl 1394.60027 · doi:10.1017/CBO9781139107174.015
[17] Griffin, J. and Holmes, C. (2010). Computational issues arising in Bayesian nonparametric hierarchical models. In Bayesian Nonparametrics (N. L. Hjort, C. Holmes, P. Müller and S. G. Walker, eds.). Cambridge Series on Statistical and Probabilistic Mathematics 28 208-222. Cambridge Univ. Press, Cambridge. · doi:10.1017/CBO9780511802478.007
[18] Heard, N. A., Holmes, C. C. and Stephens, D. A. (2006). A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves. J. Amer. Statist. Assoc. 101 18-29. · Zbl 1118.62368 · doi:10.1198/016214505000000187
[19] Hjort, N. L., Holmes, C., Müller, P. and Walker, S. G., eds. (2010). Bayesian Nonparametrics. Cambridge Series in Statistical and Probabilistic Mathematics 28 . Cambridge Univ. Press, Cambridge. · Zbl 1192.62080
[20] Housden, B. (2011). Notch targets and EGFR pathway regulation. Ph.D. thesis, Univ. Cambridge.
[21] Housden, B. E., Fu, A. Q., Krejci, A., Bernard, F., Fischer, B., Tavaré, S., Russell, S. and Bray, S. J. (2013). Transcriptional dynamics elicited by a short pulse of Notch activation involves feed-forward regulation by E ( spl ) /Hes genes. PLoS Genet. 9 e1003162.
[22] Hubert, L. and Arabie, P. (1985). Comparing partitions. J. Classification 2 193-218. · Zbl 0587.62128
[23] Jain, S. and Neal, R. M. (2004). A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model. J. Comput. Graph. Statist. 13 158-182. · doi:10.1198/1061860043001
[24] Jain, S. and Neal, R. M. (2007). Splitting and merging components of a nonconjugate Dirichlet process mixture model. Bayesian Anal. 2 445-472. · Zbl 1331.62145 · doi:10.1214/07-BA219
[25] Jennings, B., Preiss, A., Delidakis, C. and Bray, S. (1994). The Notch signalling pathway is required for Enhancer of split bHLH protein expression during neurogenesis in the Drosophila embryo. Development 120 3537-3548.
[26] Kalli, M., Griffin, J. E. and Walker, S. G. (2011). Slice sampling mixture models. Stat. Comput. 21 93-105. · Zbl 1256.65006 · doi:10.1007/s11222-009-9150-y
[27] Krejci, A., Bernard, F., Housden, B. E., Collins, S. and Bray, S. J. (2009). Direct response to Notch activation: Signaling crosstalk and incoherent logic. Sci. STKE 2 ra1.
[28] Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2 83-97. · Zbl 0143.41905 · doi:10.1002/nav.3800020109
[29] Lau, J. W. and Green, P. J. (2007). Bayesian model-based clustering procedures. J. Comput. Graph. Statist. 16 526-558. · doi:10.1198/106186007X238855
[30] Ma, P., Castillo-Davis, C. I., Zhong, W. and Liu, J. S. (2006). A data-driven clustering method for time course gene expression data. Nucleic Acids Res. 34 1261-1269.
[31] MacEachern, S. N. and Müller, P. (1998). Estimating mixture of Dirichlet process models. J. Comput. Graph. Statist. 7 223-238.
[32] McAdams, H. H. and Arkin, A. (1997). Stochastic mechanisms in gene expression. Proc. Natl. Acad. Sci. USA 94 814-819.
[33] McNicholas, P. D. and Murphy, T. B. (2010). Model-based clustering of longitudinal data. Canad. J. Statist. 38 153-168. · Zbl 1190.62120
[34] Medvedovic, M. and Sivaganesan, S. (2002). Bayesian infinite mixture model based clustering of gene expression profiles. Bioinformatics 18 1194-1206.
[35] Medvedovic, M., Yeung, K. Y. and Burngarner, R. E. (2004). Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20 1222-1232.
[36] Merton, R. C. (1971). Optimum consumption and portfolio rules in a continuous-time model. J. Econom. Theory 3 373-413. · Zbl 1011.91502 · doi:10.1016/0022-0531(71)90038-X
[37] Munkres, J. (1957). Algorithms for the assignment and transportation problems. J. Soc. Indust. Appl. Math. 5 32-38. · Zbl 0131.36604 · doi:10.1137/0105003
[38] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249-265.
[39] Papaspiliopoulos, O. and Roberts, G. O. (2008). Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models. Biometrika 95 169-186. · Zbl 1437.62576 · doi:10.1093/biomet/asm086
[40] Pitman, J. (2006). Combinatorial Stochastic Processes. Lecture Notes in Math. 1875 . Springer, Berlin. · Zbl 1103.60004
[41] Pitman, J. and Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Ann. Probab. 25 855-900. · Zbl 0880.60076 · doi:10.1214/aop/1024404422
[42] Qin, Z. S. (2006). Clustering microarray gene expression data using weighted Chinese restaurant process. Bioinformatics 22 1988-1997.
[43] Rasmussen, C. E., de la Cruz, B. J., Ghahramani, Z. and Wild, D. L. (2009). Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures. IEEE/ACM Trans. Comput. Biol. Bioinf. 6 615-628.
[44] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. Ser. B Stat. Methodol. 59 731-792. · Zbl 0891.62020 · doi:10.1111/1467-9868.00095
[45] Schliep, A., Costa, I. G., Steinhoff, C. and Schönhuth, A. (2005). Analyzing gene expression time-courses. IEEE/ACM Trans. Comput. Biol. Bioinf. 2 179-193.
[46] Searle, S. R., Casella, G. and McCulloch, C. E. (2006). Variance Components . Wiley-Interscience, Hoboken, NJ. · Zbl 1108.62064
[47] Spudich, J. L. and Koshland, J. D. E. (1976). Non-genetic individuality: Chance in the single cell. Nature 262 467-471.
[48] Stephens, M. (2000a). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Ann. Statist. 28 40-74. · Zbl 1106.62316 · doi:10.1214/aos/1016120364
[49] Stephens, M. (2000b). Dealing with label switching in mixture models. J. R. Stat. Soc. Ser. B Stat. Methodol. 62 795-809. · Zbl 0957.62020 · doi:10.1111/1467-9868.00265
[50] Storey, J. D., Xiao, W., Leek, J. T., Tompkins, R. G. and Davis, R. W. (2005). Significance analysis of time course microarray experiments. Proc. Natl. Acad. Sci. USA 102 12837-12842.
[51] Taylor, H. M. and Karlin, S. (1998). An Introduction to Stochastic Modeling , 3rd ed. Academic Press, San Diego, CA. · Zbl 0946.60002
[52] Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput. 36 45-54. · Zbl 1113.62058 · doi:10.1080/03610910601096262
[53] Zhou, C. and Wakefield, J. (2006). A Bayesian mixture model for partitioning gene expression data. Biometrics 62 515-525. · Zbl 1097.62140 · doi:10.1111/j.1541-0420.2005.00492.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.