×

Model-based clustering based on sparse finite Gaussian mixtures. (English) Zbl 1342.62109

Summary: In the framework of Bayesian model-based clustering based on a finite mixture of Gaussian distributions, we present a joint approach to estimate the number of mixture components and identify cluster-relevant variables simultaneously as well as to obtain an identified model. Our approach consists in specifying sparse hierarchical priors on the mixture weights and component means. In a deliberately overfitting mixture model the sparse prior on the weights empties superfluous components during MCMC. A straightforward estimator for the true number of components is given by the most frequent number of non-empty components visited during MCMC sampling. Specifying a shrinkage prior, namely the normal gamma prior, on the component means leads to improved parameter estimates as well as identification of cluster-relevant variables. After estimating the mixture model using MCMC methods based on data augmentation and Gibbs sampling, an identified model is obtained by relabeling the MCMC output in the point process representation of the draws. This is performed using \(K\)-centroids cluster analysis based on the Mahalanobis distance. We evaluate our proposed strategy in a simulation setup with artificial data and by applying it to benchmark data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F15 Bayesian inference
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Anderson, E.: The Irises of the Gaspé Peninsula. Bull. Am. Iris Soc. 59, 2-5 (1935)
[2] Armagan, A., Dunson, D., Clyde, M.: Generalized beta mixtures of Gaussians. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K. (eds.) Advances in Neural Information Processing Systems (NIPS) 24, pp. 523-531, Curran Associates, Inc., (2011) · Zbl 1436.62266
[3] Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803-821 (1993) · Zbl 0794.62034
[4] Baudry, J., Raftery, A.E., Celeux, G., Lo, K., Gottardo, R.: Combining mixture components for clustering. J. Comput. Gr. Stat. 19, 332-353 (2010)
[5] Bensmail, H., Celeux, G., Raftery, A.E., Robert, C.P.: Inference in model-based cluster analysis. Stat. Comput. 7, 1-10 (1997)
[6] Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719-725 (2000)
[7] Campbell, N., Mahon, R.: A multivariate study of variation in two species of rock crab of genus Leptograpsus. Austr. J. Zool. 22, 417-425 (1974)
[8] Celeux, G.; Green, PJ (ed.); Rayne, R. (ed.), Bayesian inference for mixture: the label switching problem, 227-232 (1998), Heidelberg · Zbl 0951.62018
[9] Celeux, G., Hurn, M., Robert, C.P.: Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95, 957-970 (2000) · Zbl 0999.62020
[10] Celeux, G., Forbes, F., Robert, C.P., Titterington, D.M.: Deviance information criteria for missing data models. Bayesian Anal. 1(4), 651-674 (2006) · Zbl 1331.62329
[11] Chung, Y., Dunson, D.: Nonparametric Bayes conditional distribution modeling with variable selection. J. Am. Stat. Assoc. 104, 1646-1660 (2009) · Zbl 1205.62039
[12] Dasgupta, A., Raftery, A.E.: Detecting features in spatial point processes with clutter via model-based clustering. J. Am. Stat. Assoc. 93(441), 294-302 (1998) · Zbl 0906.62105
[13] Dean, N., Raftery, A.E.: Latent class analysis variable selection. Ann. Inst. Stat. Math. 62, 11-35 (2010) · Zbl 1422.62085
[14] Dellaportas, P., Papageorgiou, I.: Multivariate mixtures of normals with unknown number of components. Stat. Comput. 16, 57-68 (2006)
[15] Diebolt, J., Robert, C.P.: Estimation of finite mixture distributions through Bayesian sampling. J. R. Stat. Soc. B 56, 363-375 (1994) · Zbl 0796.62028
[16] Fisher, R.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(2), 179-188 (1936)
[17] Frühwirth-Schnatter, S.: Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models. J. Am. Stat. Assoc. 96(453), 194-209 (2001) · Zbl 1015.62022
[18] Frühwirth-Schnatter, S.: Estimating marginal likelihoods for mixture and Markov switching models using bridge sampling techniques. Econ. J. 7, 143-167 (2004) · Zbl 1053.62087
[19] Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer-Verlag, New York (2006) · Zbl 1108.62002
[20] Frühwirth-Schnatter, S.; Mengerson, K. (ed.); Robert, C. (ed.); Titterington, D. (ed.), Label switching under model uncertainty, 213-239 (2011), New York
[21] Frühwirth-Schnatter, S.: Panel data analysis - a survey on model-based clustering of time series. Adv. Data Anal. Classif. 5(4), 251-280 (2011b) · Zbl 1274.62591
[22] Frühwirth-Schnatter, S., Kaufmann, S.: Model-based clustering of multiple time series. J. Bus. Econ. Stat. 26(1), 78-89 (2008)
[23] Frühwirth-Schnatter, S., Pyne, S.: Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions. Biostatistics 11(2), 317-336 (2010) · Zbl 1437.62465
[24] Geweke, J.: Interpretation and inference in mixture models: simple MCMC works. Comput. Stat. Data Anal. 51, 3529-3550 (2007) · Zbl 1161.62338
[25] Griffin, J.E., Brown, P.J.: Inference with normal-gamma prior distributions in regression problems. Bayesian Anal. 5(1), 171-188 (2010) · Zbl 1330.62128
[26] Grün, B., Leisch, F.: Dealing with label switching in mixture models under genuine multimodality. J. Multivar. Anal. 100(5), 851-861 (2009) · Zbl 1157.62040
[27] Handcock, M.S., Raftery, A.E., Tantrum, J.M.: Model-based clustering for social networks. J. R. Stat. Soc. A 170(2), 301-354 (2007)
[28] Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4, 3-34 (2010) · Zbl 1306.62141
[29] Ishwaran, H., James, L.F., Sun, J.: Bayesian model selection in finite mixtures by marginal density decompositions. J. Am. Stat. Assoc. 96(456), 1316-1332 (2001) · Zbl 1051.62027
[30] Jasra, A., Holmes, C.C., Stephens, D.A.: Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling. Stat. Sci. 20(1), 50-67 (2005) · Zbl 1100.62032
[31] Juárez, M.A., Steel, M.F.J.: Model-based clustering of non-Gaussian panel data based on skew-t distributions. J. Bus. Econ. Stat. 28(1), 52-66 (2010) · Zbl 1198.62097
[32] Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990) · Zbl 1345.62009
[33] Kim, S., Tadesse, M.G., Vannucci, M.: Variable selection in clustering via Dirichlet process mixture models. Biometrika 93(4), 877-893 (2006) · Zbl 1436.62266
[34] Kundu, S., Dunson, D.B.: Bayes variable selection in semiparametric linear models. J. Am. Stat. Assoc. 109(505), 437-447 (2014) · Zbl 1367.62069
[35] Lee, H., Li, J.: Variable selection for clustering by separability based on ridgelines. J. Comput. Gr. Stat. 21(2), 315-337 (2012)
[36] Lee, S., McLachlan, G.J.: Finite mixtures of multivariate skew t-distributions: some recent and new results. Stat. Comput. 24(2), 181-202 (2014) · Zbl 1325.62107
[37] Leisch, F.: A toolbox for \[{K}K\]-centroids cluster analysis. Comput. Stat. Data Anal. 51(2), 526-544 (2006) · Zbl 1157.62439
[38] Li, J.: Clustering based on a multi-layer mixture model. J. Comput. Gr. Stat. 14, 547-568 (2005)
[39] Lian, H.: Sparse Bayesian hierarchical modeling of high-dimensional clustering problems. J. Multivar. Anal. 101(7), 1728-1737 (2010) · Zbl 1188.62137
[40] Liverani, S., Hastie, D.I., Papathomas, M., Richardson, S.: PReMiuM: An R package for profile regression mixture models using Dirichlet processes, arXiv preprint arXiv:1303.2836 (2013) · Zbl 1051.62027
[41] Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection for clustering with Gaussian mixture models. Biometrics 65(3), 701-709 (2009) · Zbl 1172.62021
[42] McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley series in probability and statistics. Wiley, New York (2000) · Zbl 0963.62061
[43] McLachlan, G.J., Bean, R.W., Peel, D.: A mixture-model based approach to the clustering of microarray expression data. Bioinformatics 18, 413-422 (2002)
[44] McNicholas, P.D., Murphy, T.B.: Parsimonious Gaussian mixture models. Stat. Comput 18(3), 285-296 (2008)
[45] McNicholas, P.D., Murphy, T.B.: Model-based clustering of longitudinal data. Can. J. Stat. 38(1), 153-168 (2010) · Zbl 1190.62120
[46] Molitor, J., Papathomas, M., Jerrett, M., Richardson, S.: Bayesian profile regression with an application to the national survey of children’s health. Biostatistics 11(3), 484-498 (2010) · Zbl 1437.62560
[47] Nobile, A.: On the posterior distribution of the number of components in a finite mixture. Ann. Stat. 32, 2044-2073 (2004) · Zbl 1056.62037
[48] Pan, W., Shen, X.: Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 8, 1145-1164 (2007) · Zbl 1222.68279
[49] Park, T., Casella, G.: The Bayesian Lasso. J. Am. Stat. Assoc. 103(482), 681-686 (2008) · Zbl 1330.62292
[50] Polson, NG; Scott, JG; Bernardo, J. (ed.); Bayarri, M. (ed.); Berger, J. (ed.); Dawid, A. (ed.); Heckerman, D. (ed.); Smith, A. (ed.); West, M. (ed.), Shrink globally, act locally: sparse Bayesian regularization and prediction, No. 9, 501-523 (2010), Oxford
[51] Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168-178 (2006) · Zbl 1118.62339
[52] Richardson, S., Green, P.J.: On Bayesian analysis of mixtures with an unknown number of components. J. R. Stat. Soc. B 59(4), 731-792 (1997) · Zbl 0891.62020
[53] Rousseau, J., Mengersen, K.: Asymptotic behaviour of the posterior distribution in overfitted mixture models. J. R. Stat. Soc. B 73(5), 689-710 (2011) · Zbl 1228.62034
[54] Sperrin, M., Jaki, T., Wit, E.: Probabilistic relabelling strategies for the label switching problem in Bayesian mixture models. Stat. Comput. 20(3), 357-366 (2010)
[55] Stephens, M.: Bayesian methods for mixtures of normal distributions. Ph.D. thesis, University of Oxford (1997) · Zbl 1330.62265
[56] Stephens, M.: Dealing with label switching in mixture models. J. R. Stat. Soc. B 62, 795-809 (2000) · Zbl 0957.62020
[57] Stingo, F.C., Vannucci, M., Downey, G.: Bayesian wavelet-based curve classification via discriminant analysis with Markov random tree priors. Statistica Sinica 22(2), 465 (2012) · Zbl 1238.62075
[58] Tadesse, M.G., Sha, N., Vanucci, M.: Bayesian variable selection in clustering high-dimensional data. J. Am. Stat. Assoc. 100(470), 602-617 (2005) · Zbl 1117.62433
[59] Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer-Verlag, New York (2002) · Zbl 1006.62003
[60] Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64(2), 440-448 (2008) · Zbl 1137.62041
[61] Xie, B., Pan, W., Shen, X.: Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics 64(3), 921-930 (2008) · Zbl 1146.62101
[62] Yao, W., Lindsay, B.G.: Bayesian mixture labeling by highest posterior density. J. Am. Stat. Assoc. 104, 758-767 (2009) · Zbl 1388.62007
[63] Yau, C., Holmes, C.: Hierarchical Bayesian nonparametric mixture models for clustering with variable relevance determination. Bayesian Anal. 6(2), 329-352 (2011) · Zbl 1330.62265
[64] Yeung, K.Y., Fraley, C., Murua, A., Raftery, A.E., Ruzzo, W.L.: Model-based clustering and data transformations for gene expression data. Bioinformatics 17, 977-987 (2001)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.