×

Model-based co-clustering for mixed type data. (English) Zbl 07160684

Summary: The importance of clustering for creating groups of observations is well known. The emergence of high-dimensional data sets with a huge number of features leads to co-clustering techniques, and several methods have been developed for simultaneously producing groups of observations and features. By grouping the data set into blocks (the crossing of a row-cluster and a column-cluster), these techniques can sometimes better summarize the data set and its inherent structure. The Latent Block Model (LBM) is a well-known method for performing co-clustering. However, recently, contexts with features of different types (here called mixed type data sets) are becoming more common. The LBM is not directly applicable to this kind of data set. Here a natural extension of the usual LBM to the“Multiple Latent Block Mode” (MLBM) is proposed in order to handle mixed type data sets. Inference is performed using a Stochastic EM-algorithm that embeds a Gibbs sampler, and allows for missing data situations. A model selection criterion is defined to choose the number of row and column clusters. The method is then applied to both simulated and real data sets.

MSC:

62-XX Statistics
PDF BibTeX XML Cite
Full Text: DOI HAL

References:

[1] Ailem, M.; Role, F.; Nadif, M., Graph modularity maximization as an effective method for co-clustering text data, Know.-Based Syst., 109, C, 160-173 (2016)
[2] Ailem, M.; Role, F.; Nadif, M., Model-based co-clustering for the effective handling of sparse data, Pattern Recognit., 72, C, 108-122 (2017)
[3] Ailem, M.; Role, F.; Nadif, M., Sparse poisson latent block model for document clustering, IEEE Trans. Knowl. Data Eng., 29, 7, 1563-1576 (2017)
[4] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE Trans. Pattern Anal. Mach. Intell., 22, 7, 719-725 (2000)
[5] Biernacki, C.; Deregnaucourt, T.; Kubicki, V., Model-based clustering with mixed/missing data using the new software MixtComp, (CMStatistics 2015 (ERCIM 2015) (2015))
[6] Biernacki, C.; Jacques, J., Model-based clustering of multivariate ordinal data relying on a stochastic binary search algorithm, Stat. Computi.g, 26, 5, 929-943 (2016) · Zbl 06652986
[7] Biernacki, C.; Lourme, A., Unifying data units and models in (co-)clustering, Adv. Data Anal. Classif., 13, 1, 7-31 (2019) · Zbl 1459.62105
[8] Bouchareb, A.; Boullé, M.; Rossi, F., Co-clustering de données mixtes à base des modèles de mélange, (Actes de la 17ème Conférence Internationale Francophone sur L’Extraction et Gestion des Connaissances (EGC’2017) (2017)), 141-152
[9] Bouveyron, C.; Bozzi, L.; Jacques, J.; Jollois, F., The functional latent block model for the co-clustering of electricity consumption curves, J. R. Stat. Soc. Ser. C. Appl. Stat., 67, 4, 897-915 (2018)
[10] Bouveyron, C.; Fauvel, M.; Girard, S., Kernel discriminant analysis and clustering with parsimonious gaussian process models, Stat. Comput., 25, 6, 1143-1162 (2015) · Zbl 1331.62302
[11] Brault, V., Estimation et Selection de Modele Pour le Modele des Blocs Latents (2014), these de doctorat dirigee par Celeux, Gilles Mathematiques Paris 11 2014
[12] Buono, N. D.; Pio, G., Non-negative matrix tri-factorization for co-clustering: An analysis of the block matrix, Inform. Sci., 301, 13-26 (2015)
[13] Celeux, G.; Chauveau, D.; Diebolt, J., Some stochastic versions of the em algorithm, J. Stat. Comput. Simul., 55, 287-314 (1996) · Zbl 0907.62024
[14] Dempster, A. P.; Laird, N. M.; Rubin, D. B., Maximum likelihood from incomplete data via the em algorithm, J. R. Stat. Soc. Ser. B, 39, 1, 1-38 (1977) · Zbl 0364.62022
[15] Donders, A. R.T.; van der Heijden, G. J.; Stijnen, T.; Moons, K. G., Review: A gentle introduction to imputation of missing values, J. Clin. Epidemiol., 59, 10, 1087-1091 (2006)
[16] Everitt, B. S., Introduction to Latent Variable Models (1984), Chapman and Hall · Zbl 0583.62049
[17] Gelman, A.; Rubin, D., Inference from iterative simulation using multiple sequences, Statist. Sci., 7, 4, 457-472 (1992) · Zbl 1386.65060
[18] Govaert, G.; Nadif, M., (Co-Clustering. Co-Clustering, Computing Engineering series (2013), ISTE-Wiley), 256
[19] Govaert, G.; Nadif, M., Mutual information, phi-squared and model-based co-clustering for contingency tables, Adv. Data Anal. Classif., 12, 3, 455-488 (2018) · Zbl 1416.62309
[20] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 1, 193-218 (1985)
[21] Jacques, J.; Biernacki, C., Model-based co-clustering for ordinal data, Comput. Statist. Data Anal., 123, C, 101-115 (2018) · Zbl 1469.62086
[22] Jones, K. S., A statistical interpretation of term specificity and its application in retrieval, J. Doc., 28, 1, 11-21 (1972)
[23] Keribin, C.; Brault, V.; Celeux, G.; Govaert, G., Estimation and selection for the latent block model on categorical data, 30 (2013), INRIA
[24] Laclau, C.; Nadif, M., Diagonal latent block model for binary data, Stat. Comput., 27, 5, 1145-1163 (2017) · Zbl 06737703
[25] Little, R. J.A.; Rubin, D. B., Statistical Analysis with Missing Data (1986), John Wiley & Sons, Inc.: John Wiley & Sons, Inc. New York, NY, USA
[26] Lubke, G.; Muthén, B., Applying multigroup confirmatory factor models for continuous outcomes to likert scale data complicates meaningful group comparisons, Struct. Equ. Model. Multidiscip. J., 11, 4, 514-534 (2004)
[27] MaloneBeach, E. E.; Zarit, S. H., Dimensions of social support and social conflict as predictors of caregiver depression, Int. Psychogeriatrics, 7, 1, 25-38 (1995)
[28] Marbac, M.; Biernacki, C.; Vandewalle, V., Model-based clustering of gaussian copulas for mixed data, Comm. Statist. Theory Methods, 46, 23 (2017) · Zbl 1384.62198
[29] McParland, D.; Gormley, I., Model based clustering for mixed data: Clustmd, Adv. Data Anal. Classif., 10, 2, 155-169 (2016) · Zbl 1414.62254
[30] McParland, D.; Phillips, C. M.; Brennan, L.; Roche, H. M.; Gormley, I. C., Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data, Stat. Med., 36, 28, 4548-4569 (2017)
[31] Nadif, M.; Govaert, G., Algorithms for model-based block gaussian clustering, (DMIN’08, the 2008 International Conference on Data Mining (2008)) · Zbl 0910.62021
[32] Robert, V., Classification Croisee Pour l’Analyse de Bases de Donnees de Grandes Dimensions de Pharmacovigilance (2017), Université Paris-Sud, (Ph.D. thesis)
[33] Salah, A.; Nadif, M., Directional co-clustering, Adv. Data Anal. Classif., 1-30 (2018)
[34] Schwarz, G., Estimating the dimension of a model, Ann. Statist., 6, 461-464 (1978) · Zbl 0379.62005
[35] Selosse, M.; Jacques, J.; Biernacki, C.; Cousson-Gélie, F., Analysing a quality-of-life survey by using a coclustering model for ordinal data and some dynamic implications, J. R. Stat. Soc. Ser. C. Appl. Stat., 68, 5, 1327-1349 (2019)
[36] Singh Bhatia, P.; Iovleff, S.; Govaert, G., Blockcluster: An R package for model-based co-clustering, J. Stat. Softw., 76, 9, 1-24 (2017)
[37] Slimen, Y. B.; Allio, S.; Jacques, J., Model-based co-clustering for functional data, Neurocomputing, 291, 97-108 (2018)
[38] Smilde, A. K.; Westerhuis, J. A.; Jong, S.d., A framework for sequential multiblock component methods, J. Chemometr., 17, 6, 323-337 (2003)
[39] Zigmond, A. S.; Snaith, R. P., The hospital anxiety and depression scale, Acta Psychiatrica Scand., 67, 6, 361-370 (1983)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.