×

A Bayesian sparse finite mixture model for clustering data from a heterogeneous population. (English) Zbl 1445.62152

Summary: In this paper, we introduce a Bayesian approach for clustering data using a sparse finite mixture model (SFMM). The SFMM is a finite mixture model with a large number of components \(k\) previously fixed where many components can be empty. In this model, the number of components \(k\) can be interpreted as the maximum number of distinct mixture components. Then, we explore the use of a prior distribution for the weights of the mixture model that take into account the possibility that the number of clusters \(k_{\mathbf{c}}\) (e.g., nonempty components) can be random and smaller than the number of components \(k\) of the finite mixture model. In order to determine clusters we develop a MCMC algorithm denominated Split-Merge allocation sampler. In this algorithm, the split-merge strategy is data-driven and was inserted within the algorithm in order to increase the mixing of the Markov chain in relation to the number of clusters. The performance of the method is verified using simulated datasets and three real datasets. The first real data set is the benchmark galaxy data, while second and third are the publicly available data set on Enzyme and Acidity, respectively.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F15 Bayesian inference
62P10 Applications of statistics to biology and medical sciences; meta analysis
62P35 Applications of statistics to physics
85A35 Statistical astronomy
85A05 Galactic and stellar dynamics

Software:

AS 136

References:

[1] Akaike, H. A. (1974). New look at the statistical model identification. IEEE Transactions on Automatic Control 19, 716-723. · Zbl 0314.62039 · doi:10.1109/TAC.1974.1100705
[2] Anderson, J. J. (1985). Normal mixtures and the number of clusters problem. Computational Statistics Quarterly 2, 3-14. · Zbl 0616.62087
[3] Banfield, J. D. and Raftery, A. E. (1993). Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803-821. · Zbl 0794.62034 · doi:10.2307/2532201
[4] Bensmail, H., Celeux, G., Raftery, A. E. and Robert, C. P. (1997). Inference in model-based cluster analysis. Statistics and Computing 7, 1-10.
[5] Binder, D. A. (1978). Bayesian cluster analysis. Biometrika 65, 31-38. · Zbl 0376.62007 · doi:10.1093/biomet/65.1.31
[6] Bouveyron, C. and Brunet, C. (2013). Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis 71, 52-78. · Zbl 1471.62032 · doi:10.1016/j.csda.2012.12.008
[7] Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrica 52, 345-370. · Zbl 0627.62005 · doi:10.1007/BF02294361
[8] Casella, G., Robert, C. and Wells, M. (2000). Mixture models, latent variables and partitioned importance sampling. Technical Report-2000-03, CREST, INSEE, Paris. · Zbl 1075.65016 · doi:10.1016/j.stamet.2004.05.001
[9] Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. Journal of the American Statistical Association 95, 957-970. · Zbl 0999.62020 · doi:10.1080/01621459.2000.10474285
[10] Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. American Statistician 49, 327-335.
[11] Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association 90, 577-588. · Zbl 0826.62021 · doi:10.1080/01621459.1995.10476550
[12] Fraley, C. and Raftery, A. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association 97. · Zbl 1073.62545 · doi:10.1198/016214502760047131
[13] Fruhwirth-Schnatter, S. (2017). From here to infinity-sparse finite versus Dirichlet process mixture in model-based clustering. https://arxiv.org/abs/1706.07194. · Zbl 1474.62225 · doi:10.1007/s11634-018-0329-y
[14] Hartigan, J. A. and Wong, M. A. (1978). Algorithm AS 136: A k-means clustering algorithm. Applied Statistics 28, 100-108. · Zbl 0447.62062
[15] Jasra, A., Holmes, C. C. and Stephens, D. A. (2005). Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Statistical Science 20, 50-67. · Zbl 1100.62032 · doi:10.1214/088342305000000016
[16] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281-297. Berkeley, CA: University of California Press. · Zbl 0214.46201
[17] McLachlan, G. and Basford, K. E. (1988). Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker. · Zbl 0697.62050
[18] McLachlan, G. and Peel, D. (2000). Finite Mixture Models. New York: Wiley Interscience. · Zbl 0963.62061
[19] Nobile, A. and Fearnside, A. T. (2007). Bayesian finite mixtures with an unknown number of components: The allocation sampler. Statistics and Computing 17, 147-162.
[20] Oh, M.-S. and Raftery, A. E. (2007). Model-based clustering with dissimilarities: A Bayesian approach. Journal of Computational and Graphical Statistics 16, 559-585.
[21] Richardson, S. and Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. Journal of the Royal Statistical Society, Series B, Statistical Methodology 59, 731-792. · Zbl 0891.62020 · doi:10.1111/1467-9868.00095
[22] Roeder, K. and Wasserman, L. (1997). Practical Bayesian density estimation using mixture of normals. Journal of the American Statistical Association 92, 894-902. · Zbl 0889.62021 · doi:10.1080/01621459.1997.10474044
[23] Saraiva, E. F., Louzada, F. and Milan, L. A. (2014). Mixture models with an unknown number of components via a new posterior split-merge MCMC algorithm. Applied Mathematics and Computation 244, 959-975. · Zbl 1335.62061 · doi:10.1016/j.amc.2014.07.032
[24] Saraiva, E. F., Suzuki, A. K., Louzada, F. and Milan, L. A. (2016). Partitioning gene expression data by data-driven Markov chain Monte Carlo. Journal of Applied Statistics 43, 1155-1173. · Zbl 1514.62846
[25] Saraiva, E. F., Suzuki, A. K. and Milan, L. A. (2019). Supplement to “A Bayesian sparse finite mixture model for clustering data from a heterogeneous population.” https://doi.org/10.1214/18-BJPS425SUPP.
[26] Schwarz, G. E. (1978). Estimating the dimension of a model. The Annals of Statistics 6, 461-464. · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[27] Sneath, P. H. A. (1957). The application of computers to taxonomy. Journal of General Microbiology 17, 201-206.
[28] Sokal, R. R. and Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin 38, 1409-1438.
[29] Spiegelhalter, D. J., Best, N. G., Carlin, B. P. and Van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B 64, 583-616. · Zbl 1067.62010 · doi:10.1111/1467-9868.00353
[30] Stephens, M. (2000). Dealing with label switching in mixture models. Journal of the Royal Statistical Society, Series B, Statistical Methodology 62, 795-809. · Zbl 0957.62020 · doi:10.1111/1467-9868.00265
[31] Walli, G. M., Frhwirth-Schnatter, S. and Grn, B. (2016). Model-based clustering based on sparse finite Gaussian mixtures. Statistics and Computing 34, 303-324. · Zbl 1342.62109 · doi:10.1007/s11222-014-9500-2
[32] Ward, J. H. (1963). Hierarchical groupings to optimize an objective function. Journal of the American Statistical Association 58, 234-244.
[33] Witten, D. · Zbl 1392.62194 · doi:10.1198/jasa.2010.tm09415
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.