×

Mixtures of general location model with factor analyzer covariance structure for clustering mixed type data. (English) Zbl 1516.62121

Summary: Cluster analysis is one of the most widely used method in statistical analyses, in which homogeneous subgroups are identified in a heterogeneous population. Due to the existence of the continuous and discrete mixed data in many applications, so far, some ordinary clustering methods such as, hierarchical methods, \(k\)-means and model-based methods have been extended for analysis of mixed data. However, in the available model-based clustering methods, by increasing the number of continuous variables, the number of parameters increases and identifying as well as fitting an appropriate model may be difficult. In this paper, to reduce the number of the parameters, for the model-based clustering mixed data of continuous (normal) and nominal data, a set of parsimonious models is introduced. Models in this set are extended, using the general location model approach, for modeling distribution of mixed variables and applying factor analyzer structure for covariance matrices. The ECM algorithm is used for estimating the parameters of these models. In order to show the performance of the proposed models for clustering, results from some simulation studies and analyzing two real data sets are presented.

MSC:

62-XX Statistics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Ahmad, A.; Dey, L., A k-mean clustering algorithm for mixed numeric and categorical data, Data Knowledge Eng., 63, 503-527 (2007) · doi:10.1016/j.datak.2007.03.016
[2] Amiri, L.; Khazaei, M.; Ganjali, M., General location model with factor analyzer covariance matrix structure and its applications, Adv. Data. Anal. Classif., 11, 593-609 (2017) · Zbl 1414.62205 · doi:10.1007/s11634-016-0258-6
[3] An, X.; Bentler, P., Extended mixture factor analysis model with covariates for mixed binary and continuous responses, Stat. Med., 30, 2634-2647 (2011)
[4] Azzalini, A.; Menardi, G., Clustering via nonparametric density estimation: The R package pdfCluster, J. Stat. Softw., 57, 1-26 (2014) · doi:10.18637/jss.v057.i11
[5] Baek, J.; McLachlan, G. J., Mixtures of common t-factor analyzers for clustering high-dimensional microarray data, Bioinformatics, 27, 1269-1276 (2011) · doi:10.1093/bioinformatics/btr112
[6] Baek, J.; McLachlan, G. J.; Flack, L. K., Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualisation of high-dimensional data, IEEE. Trans. Pattern. Anal. Mach. Intell., 32, 1298-1309 (2010) · doi:10.1109/TPAMI.2009.149
[7] Banfield, J. D.; Raftery, A. E., Model-based Gaussian and non-Gaussian clustering, Biometrics, 49, 803-821 (1993) · Zbl 0794.62034 · doi:10.2307/2532201
[8] Becker, C.; Fried, R.; Kuhnt, S., Robustness and Complex Data Structures (2013), Springer-Verlag: Springer-Verlag, Berlin, Heidelberg · Zbl 1290.62004
[9] Browne, R. P.; McNicholas, P. D., Model-based clustering and classification of data with mixed type, J. Statist. Plann. Inference, 142, 2976-2984 (2012) · Zbl 1335.62093 · doi:10.1016/j.jspi.2012.05.001
[10] Cai, J. H.; Song, X. Y.; Lam, K. H.; Ip, H. S., A mixture of generalized latent variable models for mixed mode and heterogeneous data, Comput. Statist. Data Anal., 55, 2889-2907 (2011) · Zbl 1218.62012 · doi:10.1016/j.csda.2011.05.011
[11] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit., 28, 781-793 (1995) · doi:10.1016/0031-3203(94)00125-6
[12] Everitt, B. S., A finite mixture model for the clustering of mixed mode data, Stat. Probab. Lett., 6, 305-309 (1988) · doi:10.1016/0167-7152(88)90004-1
[13] Everitt, B. S.; Landau, S.; Leese, M.; Stahl, D., Cluster Analysis (2011), John Wiley and Sons: John Wiley and Sons, Chichester · Zbl 1274.62003
[14] Everitt, B. S.; Merette, C., The clustering of mixed-mode data: A comparison of possible approaches, J. Appl. Stat., 17, 283-297 (1990) · doi:10.1080/02664769000000001
[15] Foss, A. and Markatou, M., Methods for Clustering Mixed-Type Data, R package version 0.1.1.1, 2016. http://CRAN.R-project.org/package=kamila · Zbl 1432.62182
[16] Foss, A.; Markatou, M.; Ray, B.; Heching, A., A semiparametric method for clustering mixed data, Mach. Learn., 105, 419-458 (2016) · Zbl 1432.62182 · doi:10.1007/s10994-016-5575-7
[17] Fraley, C.; Raftery, A. E., Model-based clustering, discriminant analysis, and density estimation, J. Amer. Statist. Assoc., 97, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[18] Gan, G.; Ma, C.; Wu, J., Data Clustering: Theory, Algorithms, and Applications (2007), ASA-SIAM: ASA-SIAM, Philadelphia · Zbl 1185.68274
[19] Gower, J. C., A general coefficient of similarity and some of its properties, Biometrics, 27, 857-872 (1971) · doi:10.2307/2528823
[20] Hennig, C.; Liao, T. F., How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, J. R. Stat. Soc. Ser. C, 62, 309-369 (2013) · doi:10.1111/j.1467-9876.2012.01066.x
[21] Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Handbook of Cluster Analysis (2015), Chapman and Hall, CRC Press: Chapman and Hall, CRC Press, Boca Raton, FL
[22] Hubert, L.; Arabie, P., Comparing partitions, J. Classif., 2, 193-218 (1985) · doi:10.1007/BF01908075
[23] Kaufman, L.; Rousseeuw, P. J., Finding Groups in Data. An Introduction to Cluster Analysis (1990), John Wiley and Sons Inc.: John Wiley and Sons Inc., New York · Zbl 1345.62009
[24] Lawrence, C.; Krzanowski, W., Mixture separation for mixed-mode data, Stat. Comput., 6, 85-92 (1996) · doi:10.1007/BF00161577
[25] Mardin, C.; Hothorn, T.; Peters, A.; Jünemann, A.; Michelson, G.; Lausen, B., New glaucoma classification method based on standard HRT parameters by bagging classification trees, J. Glaucoma, 12, 340-346 (2003) · doi:10.1097/00061198-200308000-00008
[26] McLachlan, G. J.; Krishnan, T., The EM Algorithm and Extensions (2008), John Wiley and Sons: John Wiley and Sons, New Jersey · Zbl 1165.62019
[27] McLachlan, G. J.; Peel, D., Finite mixture models (2000), John Wiley and Sons: John Wiley and Sons, New York · Zbl 0963.62061
[28] McLachlan, G. J.; Peel, D.; Bean, R. W., Modelling high-dimensional data by mixtures of factor analyzers, Comput. Statist. Data Anal., 41, 379-388 (2003) · Zbl 1256.62036 · doi:10.1016/S0167-9473(02)00183-4
[29] McNicholas, P. D., Mixture Model-based Classification (2017), Chapman and Hall/CRC Press: Chapman and Hall/CRC Press, Boca Raton, FL · Zbl 1454.62005
[30] McNicholas, P. D.; Murphy, T. B., Parsimonious Gaussian mixture models, Stat. Comput., 18, 285-296 (2008) · doi:10.1007/s11222-008-9056-0
[31] McParland, D.; Gormley, I. C., Model based clustering for mixed data: ClustMD, Adv. Data Anal. Classif., 10, 155-169 (2016) · Zbl 1414.62254 · doi:10.1007/s11634-016-0238-x
[32] Meng, X. L.; Rubin, D., Maximum likelihood estimation via the ECM algorithm: A general framework, Biometrika, 80, 267-278 (1993) · Zbl 0778.62022 · doi:10.1093/biomet/80.2.267
[33] Mengersen, K. L.; Robert, C. P.; Titterington, D. M., Mixtures, Estimation and Applications (2011), John Wiley and Sons: John Wiley and Sons, UK · Zbl 1218.62003
[34] Modha, D. S.; Spangler, W. S., Feature weighting in k-means clustering, Mach. Learn., 52, 217-237 (2003) · Zbl 1039.68111 · doi:10.1023/A:1024016609528
[35] Morlini, I., A latent variables approach for clustering mixed binary and continuous variables within a gaussian mixture model, Adv. Data. Anal. Classif., 6, 5-28 (2012) · Zbl 1284.62384 · doi:10.1007/s11634-011-0101-z
[36] Olkin, I.; Tate, R. F., Multivariate correlation models with mixed discrete and continuous variables, Ann. Math. Stat., 32, 448-465 (1961) · Zbl 0113.35101 · doi:10.1214/aoms/1177705052
[37] Pawlowsky-Glahn, V.; Buccianti, A., Compositional Data Analysis: Theory and Applications (2011), John Wiley and Sons: John Wiley and Sons, UK · Zbl 1103.62111
[38] Peters, A., Hothorn, T., Ripley, B.D., Therneau, T., and Atkinson, B., Improved Predictors, R package version 0.9-6, 2017. http://CRAN.R-project.org/package=ipred
[39] Peters, A.; Lausen, B.; Michelson, G.; Gefeller, O., Diagnosis of glaucoma by indirect classifiers, Methods Inf. Med., 1, 99-103 (2003) · doi:10.1055/s-0038-1634214
[40] Ranalli, M.; Rocci, R., Mixture models for mixed-type data through a composite likelihood approach, Comput. Statist. Data Anal., 110, 87-102 (2017) · Zbl 1466.62181 · doi:10.1016/j.csda.2016.12.016
[41] Reimann, C.; Arnoldussen, A.; Boyd, R.; Finne, T-E.; Nordgulen, Ø.; Volden, T.; Englmair, P., The Influence of a city on element contents of a terrestrial moss (Hylocomium splendens), Sci. Total Environ., 369, 419-432 (2006) · doi:10.1016/j.scitotenv.2006.04.026
[42] Reimann, C.; Arnoldussen, A.; Boyd, R.; Finne, T-E.; Koller, F.; Nordgulen, Ø.; Englmair, P., Element contents in leaves of four plant species (birch, mountain ash, fern and spruce) along anthropogenic and geogenic concentration gradients, Sci. Total Environ., 377, 416-433 (2007) · doi:10.1016/j.scitotenv.2007.02.011
[43] Schwarz, G., Estimating the dimension of a model, Ann. Stat., 6, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[44] Subedi, S.; Punzo, A.; Ingrassia, S.; McNicholas, P. D., Clustering and classification via cluster-weighted factor analyzers, Adv. Data Anal. Classif., 7, 5-40 (2013) · Zbl 1271.62137 · doi:10.1007/s11634-013-0124-8
[45] Wall, M. M.; Guo, J.; Amemiya, Y., Mixture factor analysis for approximating a nonnormally distributed continuous latent factor with continuous and dichotomous observed variables, Multivariate Behav. Res., 47, 276-313 (2012) · doi:10.1080/00273171.2012.658339
[46] Willse, A.; Boik, R. J., Identifiable finite mixtures of location models for clustering mixed-mode data, Stat. Comput., 9, 111-121 (1999) · doi:10.1023/A:1008842432747
[47] Xu, R.; Wunsch II, D. C., Clustering (2009), Wiley-IEEE Press: Wiley-IEEE Press, New York
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.