×

Sparse Bayesian hierarchical modeling of high-dimensional clustering problems. (English) Zbl 1188.62137

Summary: Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose a novel Bayesian approach based on Dirichlet processes with a sparsity prior that simultaneous performs variable selection and clustering, and also discover variables that only distinguish a subset of the cluster components. Unlike previous Bayesian formulations, we use the Dirichlet process (DP) for both clustering of samples as well as for regularizing the high-dimensional mean/variance structure. To solve the computational challenge brought by this double usage of the DP, we propose to make use of a sequential sampling scheme embedded within Markov chain Monte Carlo (MCMC) updates to improve the naive implementation of existing algorithms for DP mixture models. Our method is demonstrated on a simulation study and illustrated with the leukemia gene expression dataset.

MSC:

62F15 Bayesian inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P10 Applications of statistics to biology and medical sciences; meta analysis
65C60 Computational problems in statistics (MSC2010)
65C05 Monte Carlo methods
92C50 Medical applications (general)
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Golub, T. R.; Slonim, D. K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J. P.; Coller, H.; Loh, M. L.; Downing, J. R.; Caligiuri, M. A.; Bloomfield, C. D.; Lander, E. S., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 5439, 531-537 (1999)
[2] Banfield, J. D.; Raftery, A. E., Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 3, 803-821 (1993) · Zbl 0794.62034
[3] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97, 458, 611-631 (2002) · Zbl 1073.62545
[4] Luan, Y.; Li, H., Clustering of time-course gene expression data using a mixed-effects model with B-splines, Bioinformatics, 19, 474-482 (2003)
[5] Pan, W.; Shen, X., Penalized model-based clustering with application to variable selection, Journal of Machine Learning Research, 8, 1145-1164 (2007) · Zbl 1222.68279
[6] Ma, P.; Zhong, W., Penalized clustering of large scale functional data with multiple covariates, Journal of the American Statistical Association, 103, 482, 625-636 (2008) · Zbl 1469.62288
[7] Tibshirani, R., Regression shrinkage and selection via the Lasso, Journal of the Royal Statistical Society Series B-Methodological, 58, 1, 267-288 (1996) · Zbl 0850.62538
[8] George, E. I.; McCulloch, R. E., Variable selection via Gibbs sampling, Journal of the American Statistical Association, 88, 423, 881-889 (1993)
[9] George, E. I.; McCulloch, R. E., Approaches for Bayesian variable selection, Statistica Sinica, 7, 2, 339-373 (1997) · Zbl 0884.62031
[10] Friedman, J. H.; Meulman, J. J., Clustering objects on subsets of attributes, Journal of the Royal Statistical Society Series B-Statistical Methodology, 66, 815-839 (2004) · Zbl 1060.62064
[11] Liu, J.; Zhang, J.; Palumbo, M.; Lawrence, C., Bayesian clustering with variable and transformation selection, Bayesian Statistics, 7, 249-275 (2003)
[12] Tadesse, M. G.; Sha, N.; Vannucci, M., Bayesian variable selection in clustering high-dimensional data, Journal of the American Statistical Association, 100, 470, 602-617 (2005) · Zbl 1117.62433
[13] Kim, S.; Tadesse, M. G.; Vannucci, M., Variable selection in clustering via Dirichlet process mixture models, Biometrika, 93, 4, 877-893 (2006) · Zbl 1436.62266
[14] Hoff, P., Model-based subspace clustering, Bayesian Analysis, 1, 321-344 (2006) · Zbl 1331.62309
[15] Lucas, J.; Carvalho, C.; Wang, Q.; Bild, A.; Nevins, J. R.; West, M., Sparse statistical modelling in gene expression genomics, (Bayesian Inference for Gene Expression and Proteomics (2006), Cambridge University Press), 155-176
[16] Seo, D. M.; Goldschmidt-Clermont, P. J.; West, M., Of mice and men: sparse statistical modeling in cardiovascular genomics, Annals of Applied Statistics, 1, 1, 152-178 (2007) · Zbl 1129.62104
[17] Carvalho, C.; Chang, J.; Lucas, J.; Nevins, J.; Wang, Q.; West, M., High-dimensional sparse factor modeling: applications in gene expression genomics, Journal of the American Statistical Association, 103, 484, 1438-1456 (2008) · Zbl 1286.62091
[18] Antoniak, C. E., Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems, The Annals of Statistics, 2, 6, 1152-1174 (1974) · Zbl 0335.60034
[19] Ibrahim, J.; Chen, M. H.; Gray, R. J., Bayesian models for gene expression with DNA microarray data, Journal of the American Statistical Association, 97, 457, 88-99 (2002) · Zbl 1073.62578
[20] H.A. Chipman, E.I. George, R.E. McCulloch, BART: Bayesian additive regression trees. Annals of Applied Statistics, 2010 (in press).; H.A. Chipman, E.I. George, R.E. McCulloch, BART: Bayesian additive regression trees. Annals of Applied Statistics, 2010 (in press). · Zbl 1189.62066
[21] Rodriguez, A.; Dunson, D. B.; Gelfand, A. E., The nested Dirichlet process, Journal of the American Statistical Association, 103, 483, 1131-1144 (2008) · Zbl 1205.62062
[22] Nott, D. J., Predictive performance of Dirichlet process shrinkage methods in linear regression, Computational Statistics & Data Analysis, 52, 7, 3658-3669 (2008) · Zbl 1452.62506
[23] Dudoit, S.; Fridlyand, J.; Speed, T. P., Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American Statistical Association, 97, 457, 77-87 (2002) · Zbl 1073.62576
[24] Thomas, J. G.; Olson, J. M.; Tapscott, S. J.; Zhao, L. P., An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles, Genome Research, 11, 7, 1227-1236 (2001)
[25] C. Fraley, A.E. Raftery, MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (2006).; C. Fraley, A.E. Raftery, MCLUST version 3 for R: normal mixture modeling and model-based clustering. Technical Report no. 504, Department of Statistics, University of Washington (2006).
[26] Bickel, P. J.; Levina, E., Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli, 10, 6, 989-1010 (2004) · Zbl 1064.62073
[27] Tibshirani, R.; Hastie, T.; Narasimhan, B.; Chu, G., Class prediction by nearest shrunken centroids, with applications to DNA microarrays, Statistical Science, 18, 1, 104-117 (2003) · Zbl 1048.62109
[28] Neal, R. M., Markov chain sampling methods for Dirichlet process mixture, Journal of Computational and Graphical Statistics, 9, 2, 249-265 (2000)
[29] Jain, S.; Neal, R. M., A split-merge Markov chain Monte Carlo procedure for the dirichlet process mixture model, Journal of Computational and Graphical Statistics, 13, 1, 158-182 (2004)
[30] van Dyk, D. A.; Park, T., Partially collapsed Gibbs samplers: theory and methods, Journal of the American Statistical Association, 103, 482, 790-796 (2008) · Zbl 1471.62198
[31] Escobar, M. D.; West, M., Bayesian density estimation and inference using mixtures, Journal of the American Statistical Association, 90, 430, 577-588 (1995) · Zbl 0826.62021
[32] Stephens, M., Dealing with label switching in mixture models, Journal of the Royal Statistical Society Series B-Statistical Methodology, 62, 795-809 (2000) · Zbl 0957.62020
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.