×

Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets. (English) Zbl 07538943

Summary: Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite of this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data driven method that selects the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Ahmad, A.; Khan, SS, Survey of state-of-the-art mixed data clustering algorithms, IEEE Access, 7, 31883-31902 (2019) · doi:10.1109/ACCESS.2019.2903568
[2] Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, Berlin, pp 199-213
[3] Baydin, AG; Pearlmutter, BA; Radul, AA; Siskind, JM, Automatic differentiation in machine learning: a survey, J Mach Learn Res, 18, 1, 5595-5637 (2017) · Zbl 06982909
[4] Biernacki, C.; Celeux, G.; Govaert, G., Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models, Comput Stat Data Anal, 41, 3-4, 561-575 (2003) · Zbl 1429.62235 · doi:10.1016/S0167-9473(02)00163-9
[5] Blalock D, Ortiz JJG, Frankle J, Guttag J (2020) What is the state of neural network pruning? arXiv preprint arXiv:2003.03033
[6] Cagnone, S.; Viroli, C., A factor mixture model for analyzing heterogeneity and cognitive structure of dementia, AStA Adv Stat Anal, 98, 1, 1-20 (2014) · Zbl 1443.62408 · doi:10.1007/s10182-012-0206-5
[7] Chiu T, Fang D, Chen J, Wang Y, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 263-268
[8] Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml
[9] Ester, M.; Kriegel, HP; Sander, J.; Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise, Kdd, 96, 226-231 (1996)
[10] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 458, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[11] Fruehwirth-Schnatter S, Lopes HF (2018) Sparse bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231
[12] Ghahramani Z, Hinton GE et al (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto
[13] Gower, JC, A general coefficient of similarity and some of its properties, Biometrics, 27, 4, 857-871 (1971) · doi:10.2307/2528823
[14] Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 21-34
[15] Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min Knowl Disc, 2, 3, 283-304 (1998) · doi:10.1023/A:1009769707641
[16] Jogin M, Madhulika M, Divya G, Meghana R, Apoorva S et al (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, pp 2319-2323
[17] Kohonen, T., The self-organizing map, Proc IEEE, 78, 9, 1464-1480 (1990) · doi:10.1109/5.58325
[18] Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol 238, p 5
[19] McLachlan, GJ; Peel, D., Finite mixture models. Probability and statistics-applied probability and statistics section (2000), New York: Wiley, New York · Zbl 0963.62061
[20] McLachlan, GJ; Peel, D.; Bean, RW, Modelling high-dimensional data by mixtures of factor analyzers, Comput Stat Data Anal, 41, 3-4, 379-388 (2003) · Zbl 1256.62036 · doi:10.1016/S0167-9473(02)00183-4
[21] Melnykov, V.; Maitra, R., Finite mixture models and model-based clustering, Stat Surv, 4, 80-116 (2010) · Zbl 1190.62121 · doi:10.1214/09-SS053
[22] Moustaki, I., A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables, Br J Math Stat Psychol, 56, 2, 337-357 (2003) · doi:10.1348/000711003770480075
[23] Moustaki, I.; Knott, M., Generalized latent trait models, Psychometrika, 65, 3, 391-411 (2000) · Zbl 1291.62236 · doi:10.1007/BF02296153
[24] Nenadic O, Greenacre M (2005) Computation of multiple correspondence analysis, with code in r. Technical report, Universitat Pompeu Fabra · Zbl 1127.62054
[25] Niku, J.; Brooks, W.; Herliansyah, R.; Hui, FK; Taskinen, S.; Warton, DI, Efficient estimation of generalized linear latent variable models, PLoS ONE, 14, 5, 481-497 (2019) · doi:10.1371/journal.pone.0216129
[26] Pagès, J., Multiple factor analysis by example using R (2014), Cambridge: CRC Press, Cambridge · Zbl 1305.62007 · doi:10.1201/b17700
[27] Patil, DD; Wadhai, V.; Gokhale, J., Evaluation of decision tree pruning algorithms for complexity and classification accuracy, Int J Comput Appl, 11, 2, 23-30 (2010)
[28] Philip, G.; Ottaway, B., Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons, Archaeometry, 25, 2, 119-133 (1983) · doi:10.1111/j.1475-4754.1983.tb00671.x
[29] Rousseeuw, PJ, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J Comput Appl Math, 20, 53-65 (1987) · Zbl 0636.62059 · doi:10.1016/0377-0427(87)90125-7
[30] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 2, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[31] Selosse M, Gormley C, Jacques J, Biernacki C (2020) A bumpy journey: exploring deep gaussian mixture models. In: “I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop
[32] Viroli, C.; McLachlan, GJ, Deep gaussian mixture models, Stat Comput, 29, 1, 43-51 (2019) · Zbl 1430.62143 · doi:10.1007/s11222-017-9793-z
[33] Wei GC, Tanner MA (1990) A monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699-704
[34] Wold, S.; Sjöström, M.; Eriksson, L., Pls-regression: a basic tool of chemometrics, Chemom Intell Lab Syst, 58, 2, 109-130 (2001) · doi:10.1016/S0169-7439(01)00155-1
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.