×

zbMATH — the first resource for mathematics

Model-based clustering of Gaussian copulas for mixed data. (English) Zbl 1384.62198
Summary: Clustering of mixed data is important yet challenging due to a shortage of conventional distributions for such data. In this article, we propose a mixture model of Gaussian copulas for clustering mixed data. Indeed copulas, and Gaussian copulas in particular, are powerful tools for easily modeling the distribution of multivariate variables. This model clusters data sets with continuous, integer, and ordinal variables (all having a cumulative distribution function) by considering the intra-component dependencies in a similar way to the Gaussian mixture. Indeed, each component of the Gaussian copula mixture produces a correlation coefficient for each pair of variables and its univariate margins follow standard distributions (Gaussian, Poisson, and ordered multinomial) depending on the nature of the variable (continuous, integer, or ordinal). As an interesting by-product, this model generalizes many well-known approaches and provides tools for visualization based on its parameters. The Bayesian inference is achieved with a Metropolis-within-Gibbs sampler. The numerical experiments, on simulated and real data, illustrate the benefits of the proposed model: flexible and meaningful parameterization combined with visualization features.

MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H05 Characterization and structure theory for multivariate probability distributions; copulas
62F15 Bayesian inference
Software:
bfa; MULTIMIX
PDF BibTeX Cite
Full Text: DOI
References:
[1] Banfield, J. D., and A. E. Raftery. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics 49 (3):803-821. · Zbl 0794.62034
[2] Barnard, J., R. McCulloch, and X. Meng. 2000. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica 10 (4):1281-1312. · Zbl 0980.62045
[3] Biernacki, C., G. Celeux, and G. Govaert. 2000. Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (7):719-725.
[4] Cortez, P., and A. Morais. 2007. A data mining approach to predict forest fires using meteorological data. Associação Portuguesa para a Inteligência Artificial (APPIA).
[5] Everitt, B.1988. A finite mixture model for the clustering of mixed-mode data. Statistics & Probability Letters 6 (5):305-309.
[6] Frühwirth-Schnatter, S.. 2006. Finite mixture and Markov switching models. New York: Springer. · Zbl 1108.62002
[7] Goodman, L.1974. Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 (2):215-231. · Zbl 0281.62057
[8] Gouget, C.2006. Utilisation des modèles de mélange pour la classification automatique de données ordinales. PhD thesis, Université de Technologie de Compiègne.
[9] Hand, D., and K. Yu. 2001. Idiot’s Bayes - Not so stupid after all?International Statistical Review 69 (3):385-398. · Zbl 1213.62010
[10] Hoff, P.2007. Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics 1 (1):265-283. · Zbl 1129.62050
[11] Hoff, P., X. Niu, and J. Wellner. 2011. Information bounds for Gaussian copulas. arXiv preprint arXiv:1110.3572. · Zbl 1321.62054
[12] Hunt, L., and M. Jorgensen. 1999. Theory & methods: Mixture model clustering using the MULTIMIX program. Australian & New Zealand Journal of Statistics 41 (2):154-171. · Zbl 0962.62061
[13] Hunt, L., and M. Jorgensen. 2011. Clustering mixed data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1 (4):352-361.
[14] Jacques, J., and C. Biernacki. 2014. Model-based clustering for multivariate partial ranking data. Journal of Statistical Planning and Inference 149:201-217. · Zbl 1285.62069
[15] Joe, H.1997. Multivariate models and dependence concepts, volume 73. New York: CRC Press. · Zbl 0990.62517
[16] Joe, H.2005. Asymptotic efficiency of the two-stage estimation method for copula-based models. Journal of Multivariate Analysis 94 (2):401-419. · Zbl 1066.62061
[17] Karlis, D., and P. Tsiamyrtzis. 2008. Exact Bayesian modeling for bivariate Poisson data and extensions. Statistics and Computing 18 (1):27-40.
[18] Klaassen, C., and J. Wellner. 1997. Efficient estimation in the bivariate normal copula model: Normal margins are least favourable. Bernoulli 3 (1):55-77. · Zbl 0877.62055
[19] Krzanowski, W.1993. The location model for mixtures of categorical and continuous variables. Journal of Classification 10 (1):25-49. · Zbl 0775.62153
[20] Kullback, S., and R. A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22 (1):79-86. · Zbl 0042.38403
[21] Lebarbier, E., and T. Mary-Huard. 2006. Une introduction au critère BIC : fondements théoriques et interprétation. Journal de la SFdS 147 (1):39-57. · Zbl 1409.62025
[22] Lewis, D. D.1998. Naive Bayes at forty: The independence assumption in information retrieval. In Machine learning: ECML-98, 4-15. Berlin, Heidelberg: Springer.
[23] McLachlan, G., and D. Peel. 2000. Finite mixture models. Wiley Series in Probability and Statistics: Applied Probability and Statistics. New York: Wiley-Interscience. · Zbl 0963.62061
[24] Morlini, I.2012. A latent variables approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Advances in Data Analysis and Classification 6 (1):5-28. · Zbl 1284.62384
[25] Moustaki, I., and I. Papageorgiou. 2005. Latent class models for mixed variables with applications in archaeometry. Computational Statistics & Data Analysis 48 (3):659-675. · Zbl 1430.62254
[26] Murray, J., D. Dunson, L. Carin, and J. Lucas. 2013. Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association 108 (502):656-665. · Zbl 06195968
[27] Nelsen, R. B.1999. An introduction to copulas. New York: Springer. · Zbl 0909.62052
[28] Olsson, U.. 1979. Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika 44 (4):443-460. · Zbl 0428.62083
[29] Pitt, M., D. Chan, and R. Kohn. 2006. Efficient Bayesian inference for Gaussian copula regression models. Biometrika 93 (3):537-554. · Zbl 1108.62027
[30] Raftery, A. E.1996. Hypothesis testing and model selection. In Markov chain Monte Carlo in practice, 163-187. Chapman & Hall, London: Springer. · Zbl 0841.62019
[31] Robert, C.2007. The Bayesian choice: From decision-theoretic foundations to computational implementation. New York: Springer. · Zbl 1129.62003
[32] Robert, C., and G. Casella. 2004. Monte Carlo statistical methods. New York: Springer Verlag. · Zbl 1096.62003
[33] Schwarz, G.1978. Estimating the dimension of a model. Annals of Statistics 6:461-464. · Zbl 0379.62005
[34] Smith, M., and M. Khaled. 2012. Estimation of copula models with discrete margins via Bayesian data augmentation. Journal of the American Statistical Association 107 (497):290-303. · Zbl 1261.62051
[35] Song, P. X.-K., Y. Fan, and J. D. Kalbfleisch. 2005. Maximization by parts in likelihood inference. Journal of the American Statistical Association 100 (472):1145-1158. · Zbl 1117.62429
[36] Stephens, M.2000. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 62 (4):795-809. · Zbl 0957.62020
[37] Teicher, H.1963. Identifiability of finite mixtures. The Annals of Mathematical Statistics1265-1269. · Zbl 0137.12704
[38] Van Hattum, P., and H. Hoijtink. 2009. Market segmentation using brand strategy research: Bayesian inference with respect to mixtures of log-linear models. Journal of Classification 26 (3):297-328. · Zbl 1337.62144
[39] Willse, A., and R. Boik. 1999. Identifiable finite mixtures of location models for clustering mixed-mode data. Statistics and Computing 9 (2):111-121.
[40] Yakowitz, S. J., and J. D. Spragins. 1968. On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39 (1):209-214. · Zbl 0155.25703
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.