×

zbMATH — the first resource for mathematics

Model-based clustering using copulas with applications. (English) Zbl 06652996
Summary: The majority of model-based clustering techniques is based on multivariate normal models and their variants. In this paper copulas are used for the construction of flexible families of models for clustering applications. The use of copulas in model-based clustering offers two direct advantages over current methods: (i) the appropriate choice of copulas provides the ability to obtain a range of exotic shapes for the clusters, and (ii) the explicit choice of marginal distributions for the clusters allows the modelling of multivariate data of various modes (either discrete or continuous) in a natural way. This paper introduces and studies the framework of copula-based finite mixture models for clustering applications. Estimation in the general case can be performed using standard EM, and, depending on the mode of the data, more efficient procedures are provided that can fully exploit the copula structure. The closure properties of the mixture models under marginalization are discussed, and for continuous, real-valued data parametric rotations in the sample space are introduced, with a parallel discussion on parameter identifiability depending on the choice of copulas for the components. The exposition of the methodology is accompanied and motivated by the analysis of real and artificial data.

MSC:
62-XX Statistics
PDF BibTeX Cite
Full Text: DOI
References:
[1] Alfo, M; Maruotti, A; Trovato, G, A finite mixture model for multivariate counts under endogenous selectivity, Stat. Comput., 21, 185-202, (2011)
[2] Andrews, JL; McNicholas, PD, Mixtures of modified t-factor analyzers for model-based clustering, classification, and discriminant analysis, J. Stat. Plan. Inference, 141, 1479-1486, (2011) · Zbl 1204.62098
[3] Banfield, JD; Raftery, AE, Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034
[4] Bedford, T; Cooke, RM, Vines—a new graphical model for dependent random variables, Ann. Stat., 30, 1031-1068, (2002) · Zbl 1101.62339
[5] Brechmann, E.C., Schepsmeier, U.: Modeling dependence with c- and d-vine copulas: The r package cdvine. J. Stat. Softw. 52(3), 1-27 (2013) · Zbl 1204.62098
[6] Browne, R; McNicholas, P, Model-based clustering, classification, and discriminant analysis of data with mixed type, J. Stat. Plan. Inference, 142, 2976-2984, (2012) · Zbl 1335.62093
[7] Celeux, G; Govaert, G, Gaussian parsimonious clustering models, Pattern Recogn., 28, 781-793, (1995)
[8] Dean, N; Nugent, R, Clustering student skill set profiles in a unit hypercube using mixtures of multivariate betas, Adv. Data Anal. Classif., 7, 339-357, (2013) · Zbl 1416.62334
[9] Lascio, FML; Giannerini, S, A copula-based algorithm for discovering patterns of dependent observations, J. Classif., 29, 50-75, (2012) · Zbl 1360.62250
[10] Fang, H-B; Fang, K-T; Kotz, S, The meta-elliptical distributions with given marginals, J. Multivar. Anal., 82, 1-16, (2002) · Zbl 1002.62016
[11] Forbes, F; Wraith, D, A new family of multivariate heavy-tailed distributions with variable marginal amounts of tailweight: application to robust clustering, Stat. Comput., 24, 971-984, (2014) · Zbl 1332.62204
[12] Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: mclust version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of Washington, Seattle (2012) · Zbl 1061.62198
[13] Frühwirth-Schnatter, S; Pyne, S, Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-t distributions, Biostatistics, 11, 317-336, (2010)
[14] Genest, C; Nešlehová, J, A primer on copulas for count data, ASTIN Bull., 37, 475-515, (2007) · Zbl 1274.62398
[15] Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate normal and t distributions. R package version 0.9-9996. http://cran.r-project.org/package=mvtnorm (2013) · Zbl 0843.62016
[16] Hanson, A.J.: Rotations for \(n\)-dimensional graphics. In Paeth, A. W. (Ed.), Graphics Gems V, Number II.4 in The Graphics Gems, Chapter II, pp. 55-64. Academic Press, San Diego (1995)
[17] Hennig, C, Methods for merging Gaussian mixture components, Adv. Data Anal. Classif., 4, 3-34, (2010) · Zbl 1306.62141
[18] Henningsen, A; Toomet, O, Maxlik: A package for maximum likelihood estimation in R, Comput. Stat., 26, 443-458, (2011) · Zbl 1304.65039
[19] Hofert, M., Kojadinovic, I., Maechler, M., Yan, J.: copula: Multivariate Dependence with Copulas. R package version 0.999-13 (2015)
[20] Hofert, M; Mächler, M; McNeil, AJ, Likelihood inference for Archimedean copulas in high dimensions under known margins, J. Multivar. Anal., 110, 133-150, (2012) · Zbl 1244.62073
[21] Jajuga, K; Papla, D, Copula functions in model based clustering, No. 15, 606-613, (2006), Berlin
[22] Joe, H, Approximations to multivariate normal rectangle probabilities based on conditional expectations, J. Am. Stat. Assoc., 90, 957-964, (1995) · Zbl 0843.62016
[23] Joe, H.: Multivariate Models Depend Concepts. Chapman & Hall Ltd, London (1997) · Zbl 0990.62517
[24] Johnson, N., Kotz, S., Balakrishnan, N.: Multivariate Discrete Distributions. Wiley, New York (1997) · Zbl 0868.62048
[25] Jorgensen, M, Using multinomial mixture models to cluster Internet traffic, Aust. N. Z. J. Stat., 46, 205-218, (2004) · Zbl 1061.62198
[26] Karlis, D; Meligkotsidou, L, Finite multivariate Poisson mixtures with applications, J. Stat. Plan. Inference, 137, 1942-1960, (2007) · Zbl 1116.60006
[27] Karlis, D; Santourian, A, Model-based clustering with non-elliptically contoured distributions, Stat. Comput., 19, 73-83, (2009)
[28] Lee, S; McLachlan, G, Finite mixtures of multivariate skew t-distributions: some recent and new results, Stat. Comput., 24, 181-202, (2014) · Zbl 1325.62107
[29] Lin, T.-I., Ho, H., Lee, C.-R.: Flexible mixture modelling using the multivariate skew-t-normal distribution. Stat. Comput. 24(4), 531-546 (2014) · Zbl 1325.62113
[30] Marbac, M., Biernacki, C., Vandewalle, V.: Model-based clustering of Gaussian copulas for mixed data. ArXiv e-prints (2014). arXiv:1405.1299 · Zbl 1384.62198
[31] McLachlan, G., Peel, D.: Finite Mixture Models. Wiley, New York (2000) · Zbl 0963.62061
[32] McNicholas, PD; Murphy, TB, Parsimonious Gaussian mixture models, Stat. Comput., 18, 285-296, (2008)
[33] Meng, X-L; Rubin, DB, Maximum likelihood estimation via the ECM algorithm: a general framework, Biometrika, 80, 267-278, (1993) · Zbl 0778.62022
[34] Morris, K; McNicholas, P, Dimension reduction for model-based clustering via mixtures of shifted asymmetric Laplace distributions, Stat. Probab. Lett., 83, 2088-2093, (2013) · Zbl 1282.62153
[35] Nelsen, R.: An introduction to copulas, Springer series in statistics, 2nd ed. Springer, Berlin (2006) · Zbl 1304.65039
[36] Panagiotelis, A; Czado, C; Joe, M, Pair copula constructions for multivariate discrete data, J. Am. Stat. Assoc., 107, 1063-1072, (2012) · Zbl 1395.62114
[37] R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2015) · Zbl 1244.62073
[38] Robitzsch, A., Kiefer, T., George, A.C., Uenlue, A.: CDM: cognitive diagnosis modeling. R package version 2.6-13. http://cran.r-project.org/package=CDM (2014)
[39] Vrac, M; Billard, L; Diday, E; Chèdin, A, Copula analysis of mixture models, Comput. Stat., 27, 427-457, (2012) · Zbl 1304.65087
[40] Zimmer, D; Trivedi, P, Using trivariate copulas to model sample selection and treatment effects: application to family health care demand, J. Bus. Econ. Stat., 24, 63-72, (2006)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.