zbMATH — the first resource for mathematics

Clustering and variable selection for categorical multivariate data. (English) Zbl 1349.62259
Summary: This article investigates unsupervised classification techniques for categorical multivariate data. The study employs multivariate multinomial mixture modeling, which is a type of model particularly applicable to multilocus genotypic data. A model selection procedure is used to simultaneously select the number of components and the relevant variables. A non-asymptotic oracle inequality is obtained, leading to the proposal of a new penalized maximum likelihood criterion. The selected model proves to be asymptotically consistent under weak assumptions on the true probability underlying the observations. The main theoretical result obtained in this study suggests a penalty function defined to within a multiplicative parameter. In practice, the data-driven calibration of the penalty function is made possible by slope heuristics. Based on simulated data, this procedure is found to improve the performance of the selection procedure with respect to classical criteria such as BIC and AIC. The new criterion provides an answer to the question “Which criterion for which sample size?” Examples of real dataset applications are also provided.

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62F07 Statistical ranking and selection procedures
Full Text: DOI Euclid arXiv
[1] Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. J. Mach. Learn. Res. 10 245-279.
[2] Asuncion, A. and Newman, D. J. (2007). UCI Machine Learning Repository.
[3] Bai, Z., Rao, C. R. and Wu, Y. (1999). Model selection with data-oriented penalty. J. Statist. Plann. Inference 77 102-117. · Zbl 0926.62045
[4] Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. 22 719-725.
[5] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Related Fields 138 33-73. · Zbl 1112.62082
[6] Celeux, G. and Govaert, G. (1991). Clustering criteria for discrete data and latent class models. J. Classif. 8 157-176. · Zbl 0775.62150
[7] Celeux, G., Hurn, M. and Robert, C. P. (2000). Computational and inferential difficulties with mixture posterior distributions. J. Am. Stat. Assoc. 95 957-970. · Zbl 0999.62020
[8] Chen, C., Forbes, F. and Francois, O. (2006). Fastruct: Model-based clustering made faster. Molecular Ecology Notes 6 980-983.
[9] Collins, L. M. and Lanza, S. T. (2010). Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences . Wiley Series in Probability and Statistics . Wiley.
[10] Corander, J., Marttinen, P., Sirén, J. and Tang, J. (2008). Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics 9 539.
[11] Dempster, A. P., Lairdsand, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. Series B 39 1-38. · Zbl 0364.62022
[12] Genoveve, C. R. and Wasserman, L. (2000). Rates of convergence for the Gaussian mixture sieve. Ann. Statist. 28 1105-1127. · Zbl 1105.62333
[13] Goodman, L. A. (1974). Exploratory latent structure analysis using both identifiable and unidentifiable models. Biometrika 61 215-231. · Zbl 0281.62057
[14] Latch, E. K., Dharmarajan, G., Glaubitz, J. C. and Rhodes, O. E. Jr. (2006). Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation. Conservation Genetics 7 295.
[15] Lebarbier, É. (2002). Quelques approches pour la détection de rupture à horizon fini PhD thesis, Univ Paris-Sud, F-91405 Orsay.
[16] Massart, P. (2007). Concentration inequalities and model selection . Lecture Notes in Mathematics 1896 . Springer-Verlag, Berlin. · Zbl 1170.60006
[17] Maugis, C. and Michel, B. (2011a). A non asymptotic penalized criterion for Gaussian mixture model selection. ESAIM: P&S 15 41-68. · Zbl 1395.62162
[18] Maugis, C. and Michel, B. (2011b). Data-driven penalty calibration: A case study for Gaussian mixture model selection. ESAIM: P&S 15 320-339. · Zbl 1395.62163
[19] McCutcheon, A. L. (1987). Latent Class Analysis . Quantitative Applications in the Social Sciences 64 . Sage Publications, Thousand Oaks, California.
[20] McLachlan, G. and Peel, D. (2000). Finite Mixture Models . Wiley Series in Probability and Statistics . Wiley. · Zbl 0963.62061
[21] Nadif, M. and Govaert, G. (1998). Clustering for binary data and mixture models - choice of the model. Appl. Stoch. Models Data Anal. 13 269-278. · Zbl 0910.62021
[22] Pritchard, J. K., Stephens, M. and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics 155 945-59.
[23] Rigouste, L., Cappé, O. and Yvon, F. (2006). Inference and evaluation of the multinomial mixture model for text clustering. Inform. Process. Manag. 43 1260-1280.
[24] Rosenberg, N. A., Burke, T., Elo, K., Feldman, M. W., Freidlin, P. J., Groenen, M. A. M., Hillel, J., Ma, A., Vignal, A., Wimmers, K. and Weigend, S. (2001). Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. Biotechnology .
[25] Toussile, W. and Gassiat, E. (2009). Variable selection in model-based clustering using multilocus genotype data. Adv. Data Anal. Classif. 3 109-134. · Zbl 1284.62397
[26] Verzelen, N. (2009). Adaptative estimation to regular Gaussian Markov random fields PhD thesis, Univ Paris-Sud.
[27] Villers, F. (2007). Tests et selection de modèles pour l’analyse de données protéomiques et transcriptomiques PhD thesis, Univ Paris-Sud.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.