# zbMATH — the first resource for mathematics

Clustering bivariate mixed-type data via the cluster-weighted model. (English) Zbl 1347.65030
Summary: The cluster-weighted model (CWM) is a mixture model with random covariates that allows for flexible clustering/classification and distribution estimation of a random vector composed of a response variable and a set of covariates. Within this class of models, the generalized linear exponential CWM is here introduced especially for modeling bivariate data of mixed-type. Its natural counterpart in the family of latent class models is also defined. Maximum likelihood parameter estimates are derived using the expectation-maximization algorithm and some computational issues are detailed. Through Monte Carlo experiments, the classification performance of the proposed model is compared with other mixture-based approaches, consistency of the estimators of the regression coefficients is evaluated, and several likelihood-based information criteria are compared for selecting the number of mixture components. An application to real data is also finally considered.

##### MSC:
 65C60 Computational problems in statistics (MSC2010) 62H30 Classification and discrimination; cluster analysis (statistical aspects) 62J12 Generalized linear models (logistic models)
##### Software:
flexmix; UCI-ml; flexCWM; R; MULTIMIX; mclust
Full Text:
##### References:
 [1] Akaike, H; Petrov, BN (ed.); Csaki, F (ed.), Information theory and an extension of maximum likelihood principle, 267-281, (1973), Budapest [2] Bagnato, L; Punzo, A, Finite mixtures of unimodal beta and gamma densities and the $$k$$-bumps algorithm, Comput Stat, 28, 1571-1597, (2013) · Zbl 1306.65024 [3] Balakrishnan N, Lai C-D (2009) Continuous bivariate distributions. Springer, New York · Zbl 1267.62028 [4] Banfield, JD; Raftery, AE, Model-based gaussian and non-Gaussian clustering, Biometrics, 49, 803-821, (1993) · Zbl 0794.62034 [5] Bermúdez, L; Karlis, D, A finite mixture of bivariate Poisson regression models with an application to insurance ratemaking, Comput Stat Data Anal, 56, 3988-3999, (2012) · Zbl 1254.91264 [6] Biernacki, C; Celeux, G; Govaert, G, Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Trans Pattern Anal Mach Intell, 22, 719-725, (2000) [7] Biernacki, C; Celeux, G; Govaert, G, Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models, Comput Stat Data Anal, 41, 561-575, (2003) · Zbl 1429.62235 [8] Böhning, D; Dietz, E; Schaub, R; Schlattmann, P; Lindsay, BG, The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family, Ann Inst Stat Math, 46, 373-388, (1994) · Zbl 0802.62017 [9] Bozdogan H (1994) Theory and methodology of time series analysis. In: Proceedings of the First US/Japan Conference on the Frontiers of Statistical Modeling: An Informational Approach, vol 1. Kluwer Academic Publishers, Dordrecht · Zbl 1322.62047 [10] Bozdogan, H, Model selection and akaike’s information criterion (AIC): the general theory and its analytical extensions, Psychometrika, 52, 345-370, (1987) · Zbl 0627.62005 [11] Browne, RP; McNicholas, PD, Model-based clustering, classification, and discriminant analysis of data with mixed type, J Stat Plan Inference, 142, 2976-2984, (2012) · Zbl 1335.62093 [12] Celeux, G; Hurn, M; Robert, CP, Computational and inferential difficulties with mixture posterior distributions, J Am Stat Assoc, 95, 957-970, (2000) · Zbl 0999.62020 [13] Dempster, A; Laird, N; Rubin, D, Maximum likelihood from incomplete data via the EM algorithm, J R Stat Soc Series B Methodol, 39, 1-38, (1977) · Zbl 0364.62022 [14] Escobar, M; West, M, Bayesian density estimation and inference using mixtures, J Am Stat Assoc, 90, 577-588, (1995) · Zbl 0826.62021 [15] Fonseca JRS, Cardoso MGMS (2005) Retail clients latent segments. In: Progress in Artificial Intelligence. Springer, Berlin, pp 348-358 [16] Fonseca, JRS, The application of mixture modeling and information criteria for discovering patterns of coronary heart disease, J Appl Quant Methods, 3, 292-303, (2008) [17] Fonseca, JRS, On the performance of information criteria in latent segment models, World Acad Sci Eng Technol, 63, 2010, (2010) [18] Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical report 597, Department of Statistics, University of Washington, Seattle, Washington, USA [19] Frühwirth-Schnatter S (2006) Finite mixture and Markov switching models. Springer, New York · Zbl 1108.62002 [20] Genest, C; Neslehova, J, A primer on copulas for count data, Astin Bull, 37, 475-515, (2007) · Zbl 1274.62398 [21] Gershenfeld, N, Nonlinear inference and cluster-weighted modeling, Ann New York Acad Sci, 808, 18-24, (1997) [22] Grün, B; Leisch, F, Flexmix version 2: finite mixtures with concomitant variables and varying and constant parameters, J Stat Softw, 28, 1-35, (2008) [23] Hennig, C, Identifiablity of models for clusterwise linear regression, J Classif, 17, 273-296, (2000) · Zbl 1017.62058 [24] Hennig, C; Liao, TF, How to find an appropriate clustering for mixed type variables with application to socio-economic stratification, J R Stat Soc Series C Appl Stat, 62, 1-25, (2013) [25] Henning, G, Meanings and implications of the principle of local independence, Lang Test, 6, 95-108, (1989) [26] Hunt, LA; Basford, KE, Fitting a mixture model to three-mode three-way data with categorical and continuous variables, J Classif, 16, 283-296, (1999) · Zbl 0951.91069 [27] Hunt, LA; Jorgensen, M, Clustering mixed data, Wiley Interdiscip Rev Data Min Knowl Discov, 1, 352-361, (2011) [28] Hurvich, CM; Tsai, CL, Regression and time series model selection in small samples, Biometrika, 76, 297-307, (1989) · Zbl 0669.62085 [29] Ingrassia, S; Minotti, SC; Vittadini, G, Local statistical modeling via the cluster-weighted approach with elliptical distributions, J Classif, 29, 363-401, (2012) · Zbl 1360.62335 [30] Ingrassia, S; Minotti, SC; Punzo, A, Model-based clustering via linear cluster-weighted models, Comput Stat Data Anal, 71, 159-182, (2014) · Zbl 06975380 [31] Ingrassia, S; Punzo, A; Vittadini, G; Minotti, SC, The generalized linear mixed cluster-weighted model, J Classif, 32, 85-113, (2015) · Zbl 1331.62310 [32] Joe, H, Asymptotic efficiency of the two-stage estimation method for copula-based models, J Multivar Anal, 94, 401-419, (2005) · Zbl 1066.62061 [33] Jorgensen M, Hunt LA (1996) Mixture model clustering of data sets with categorical and continuous variables. In: Dowe DL, Korb KB, Oliver JJ (eds) Proceedings of the Conference: Information, Statistics and Induction in Science, Melbourne, Australia, 20-23 August, vol 96. River Edge, New Jersey, pp 375-384 · Zbl 0951.91069 [34] Karlis, D; Xekalaki, E, Choosing initial values for the EM algorithm for finite mixtures, Computational Statistics & Data Analysis, 41, 577-590, (2003) · Zbl 1429.62082 [35] Kocherlakota S, Kocherlakota K (1992) Bivariate discrete distributions, volume 132 of statistics: a series of textbooks and monographs. Taylor & Francis, Cambridge · Zbl 0794.62002 [36] Leisch, F, Flexmix: a general framework for finite mixture models and latent class regression in $${\sf R}$$, J Stat Softw, 11, 1-18, (2004) [37] Lichman M (2013) UCI Machine Learning Repository, University of California, School of Information and Computer Science. Irvine, CA. http://archive.ics.uci.edu/ml · Zbl 0802.62017 [38] Mazza A, Punzo A, Ingrassia S (2015) flexCWM: flexible cluster-weighted modeling. http://cran.r-project.org/web/packages/flexCWM/index.html · Zbl 1429.62235 [39] McCullagh P, Nelder J (1989) Generalized linear models, 2nd edn. Chapman & Hall, Boca Raton · Zbl 0744.62098 [40] McLachlan GJ, Peel D (2000) Finite mixture models. In: Applied probability and statistics: Wiley Series in Probability and Statistics. John Wiley & Sons, New York · Zbl 0963.62061 [41] McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering, volume 84 of statistics series. Marcel Dekker, New York · Zbl 0697.62050 [42] McQuarrie, A; Shumway, R; Tsai, C-L, The model selection criterion aicu, Stat Probab Lett, 34, 285-292, (1997) · Zbl 1064.62541 [43] Nelsen RB (2007) An introduction to copulas. Springer Series in Statistics. Springer, New York [44] Punzo, A, Flexible mixture modeling with the polynomial Gaussian cluster-weighted model, Stat Modelling, 14, 257-291, (2014) [45] Punzo A, Ingrassia S (2015) Parsimonious generalized linear Gaussian cluster-weighted models. In: Morlini I, Minerva T, Vichi M (eds) Advances in Statistical Models for Data Analysis, Studies in Classification, Data Analysis and Knowledge Organization, Switzerland. Springer International Publishing, Forthcoming · Zbl 06975376 [46] Punzo, A; Ingrassia, S, On the use of the generalized linear exponential cluster-weighted model to asses local linear independence in bivariate data, QdS J Methodol Appl Stat, 15, 131-144, (2013) [47] Punzo A, McNicholas PD (2014) Robust clustering in regression analysis via the contaminated Gaussian cluster-weighted model. arXiv.org e-print arXiv.org e-print arXiv:1409.6019 available at: arXiv:1409.6019 [48] R Core Team (2013) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna [49] Schlattmann P (2009) Medical applications of finite mixture models. Statistics for biology and health. Springer, Berlin · Zbl 1158.62082 [50] Schwarz, G, Estimating the dimension of a model, Ann Stat, 6, 461-464, (1978) · Zbl 0379.62005 [51] Sklar, M, Fonctions de répartition à n dimensions et leurs marges, Publications de l’Institut de Statistique de l’Université de Paris, 8, 229-231, (1959) · Zbl 0100.14202 [52] Stephens, M, Dealing with label switching in mixture models, J R Stat Soc Series B Stat Methodol, 62, 795-809, (2000) · Zbl 0957.62020 [53] Subedi, S; Punzo, A; Ingrassia, S; McNicholas, PD, Clustering and classification via cluster-weighted factor analyzers, Adv Data Anal Classif, 7, 5-40, (2013) · Zbl 1271.62137 [54] Subedi S, Punzo A, Ingrassia S, McNicholas PD (2015) Cluster-weighted $$t$$-factor analyzers for robust model-based clustering and dimension reduction. Stat Methods Appl 24 (in press) · Zbl 1416.62362 [55] Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. John Wiley & Sons, New York · Zbl 0646.62013 [56] Tsanas, A; Xifara, A, Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools, Energy Build, 49, 560-567, (2012) [57] Vermunt, JK; Magidson, J; Hagenaars, JA (ed.); McCutcheon, AL (ed.), Latent class cluster analysis, 89-106, (2002), Cambridge [58] Wedel, M; DeSarbo, WS, A mixture likelihood approach for generalized linear models, J Classif, 12, 21-55, (1995) · Zbl 0825.62611 [59] Wedel M, Kamakura W (2000) Market segmentation: conceptual and methodological foundations, 2nd edn. Kluwer Academic Publishers, Boston [60] Yao, W, Model based labeling for mixture models, Stat Comput, 22, 337-347, (2012) · Zbl 1322.62047 [61] Yao, W; Wei, Y; Yu, C, Robust mixture regression using the $$t$$-distribution, Comput Stat Data Anal, 71, 116-127, (2014) · Zbl 06975376
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.