×

Better than the best? Answers via model ensemble in density-based clustering. (English) Zbl 07433032

Summary: With the recent growth in data availability and complexity, and the associated outburst of elaborate modelling approaches, model selection tools have become a lifeline, providing objective criteria to deal with this increasingly challenging landscape. In fact, basing predictions and inference on a single model may be limiting if not harmful; ensemble approaches, which combine different models, have been proposed to overcome the selection step, and proven fruitful especially in the supervised learning framework. Conversely, these approaches have been scantily explored in the unsupervised setting. In this work we focus on the model-based clustering formulation, where a plethora of mixture models, with different number of components and parametrizations, is typically estimated. We propose an ensemble clustering approach that circumvents the single best model paradigm, while improving stability and robustness of the partitions. A new density estimator, being a convex linear combination of the density estimates in the ensemble, is introduced and exploited for group assignment. As opposed to the standard case, where clusters are typically associated to the components of the selected mixture model, we define partitions by borrowing the modal, or nonparametric, formulation of the clustering problem, where groups are linked with high-density regions. Staying in the density-based realm we thus show how blending together parametric and nonparametric approaches may be beneficial from a clustering perspective.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H99 Multivariate analysis

Software:

EMMIXskew; R; mclust; ks
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aghaeepour, N.; Finak, G.; Hoos, H.; Mosmann, T.; Brinkman, R.; Gottardo, R.; Scheuermann, R.; FlowCAP Consortium, DREAM Consortium, Critical assessment of automated flow cytometry data analysis techniques, Nat Methods, 10, 3, 228 (2013) · doi:10.1038/nmeth.2365
[2] Azzalini, A.; Dalla Valle, A., The multivariate skew-normal distribution, Biometrika, 83, 4, 715-726 (1996) · Zbl 0885.62062 · doi:10.1093/biomet/83.4.715
[3] Banfield J, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3):803-821 · Zbl 0794.62034
[4] Baudry, JP; Raftery, AE; Celeux, G.; Lo, K.; Gottardo, R., Combining mixture components for clustering, J Comput Graph Stat, 19, 2, 332-353 (2010) · doi:10.1198/jcgs.2010.08111
[5] Biernacki, C.; Celeux, G.; Govaert, G., Assessing a mixture model for clustering with the integrated completed likelihood, IEEE T Pattern Anal, 22, 7, 719-725 (2000) · doi:10.1109/34.865189
[6] Celeux, G.; Govaert, G., Gaussian parsimonious clustering models, Pattern Recognit, 28, 5, 781-793 (1995) · doi:10.1016/0031-3203(94)00125-6
[7] Chacón, JE, Mixture model modal clustering, Adv Data Anal Classif, 13, 2, 379-404 (2019) · Zbl 1474.62218 · doi:10.1007/s11634-018-0308-3
[8] Chacón, JE; Duong, T., Multivariate kernel smoothing and its applications (2018), London: Chapman and Hall/CRC, London · Zbl 1402.62003 · doi:10.1201/9780429485572
[9] Cheng, Y., Mean shift, mode seeking, and clustering, IEEE Trans Pattern Anal, 17, 8, 790-799 (1995) · doi:10.1109/34.400568
[10] Claeskens, G.; Hjort, N., Model selection and model averaging (2008), Cambridge: Cambridge University Press, Cambridge · Zbl 1166.62001
[11] Dempster, A.; Laird, N.; Rubin, D., Maximum likelihood from incomplete data via the EM algorithm, J R Stat Soc Ser B Stat Methodol, 39, 1, 1-22 (1977) · Zbl 0364.62022
[12] Dietterich, T., An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization, Mach Learn, 40, 2, 139-157 (2000) · doi:10.1023/A:1007607513941
[13] Duong T (2019) ks: Kernel Smoothing. R package version 1.11.4. https://CRAN.R-project.org/package=ks. Accessed Aug 2019
[14] Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning, pp 186-193
[15] Fisher, R., The use of multiple measurements in taxonomic problems, Ann Eugen, 7, 2, 179-188 (1936) · doi:10.1111/j.1469-1809.1936.tb02137.x
[16] Forina, M.; Armanino, C.; Castino, M.; Ubigli, M., Multivariate data analysis as a discriminating method of the origin of wines, Vitis, 25, 3, 189-201 (1986)
[17] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, J Am Stat Assoc, 97, 458, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[18] Friedman, J.; Hastie, T.; Tibshirani, R., The elements of statistical learning (2001), New York: Springer, New York · Zbl 0973.62007
[19] Fukunaga, K.; Hostetler, L., The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Trans Inform Theory, 21, 1, 32-40 (1975) · Zbl 0297.62025 · doi:10.1109/TIT.1975.1055330
[20] Glodek, M.; Schels, M.; Schwenker, F., Ensemble Gaussian mixture models for probability density estimation, Comput Stat, 28, 1, 127-138 (2013) · Zbl 1305.65039 · doi:10.1007/s00180-012-0374-5
[21] Hennig, C., Methods for merging Gaussian mixture components, Adv Data Anal Classif, 4, 1, 3-34 (2010) · Zbl 1306.62141 · doi:10.1007/s11634-010-0058-3
[22] Hubert, L.; Arabie, P., Comparing partitions, J Classif, 2, 1, 193-218 (1985) · doi:10.1007/BF01908075
[23] Kuncheva L, Hadjitodorov S (2004) Using diversity in cluster ensembles. In: 2004 IEEE international conference on systems, man and cybernetics, vol 2. IEEE, pp 1214-1219
[24] Leeb, H.; Pötscher, B., Model selection and inference: facts and fiction, Econom Theory, 21, 1, 21-59 (2005) · Zbl 1085.62004 · doi:10.1017/S0266466605050036
[25] Li, J., Clustering based on a multilayer mixture model, J Comput Graph Stat, 14, 3, 547-568 (2005) · doi:10.1198/106186005X59586
[26] Li, J.; Ray, S.; Lindsay, B., A nonparametric statistical approach to clustering via mode identification, J Mach Learn Res, 8, 1687-1723 (2007) · Zbl 1222.62076
[27] Madigan, D.; Raftery, AE, Model selection and accounting for model uncertainty in graphical models using Occam’s window, J Am Stat Assoc, 89, 428, 1535-1546 (1994) · Zbl 0814.62030 · doi:10.1080/01621459.1994.10476894
[28] Malsiner-Walli, G.; Frühwirth-Schnatter, S.; Grün, B., Identifying mixtures of mixtures using Bayesian estimation, J Comput Graph Stat, 26, 2, 285-295 (2017) · doi:10.1080/10618600.2016.1200472
[29] Menardi, G., A review on modal clustering, Int Stat Rev, 84, 3, 413-433 (2016) · Zbl 07763532 · doi:10.1111/insr.12109
[30] Monti, S.; Tamayo, P.; Mesirov, J.; Golub, T., Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data, Mach Learn, 52, 1-2, 91-118 (2003) · Zbl 1039.68103 · doi:10.1023/A:1023949509487
[31] R Core Team (2019) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Accessed Aug 2019
[32] Rigollet, P.; Tsybakov, A., Linear and convex aggregation of density estimators, Math Methods Stat, 16, 3, 260-280 (2007) · Zbl 1231.62057 · doi:10.3103/S1066530707030052
[33] Russell N, Murphy TB, Raftery AE (2015) Bayesian model averaging in model-based clustering and density estimation. arXiv preprint arXiv:1506.09035
[34] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 2, 461-464 (1978) · Zbl 0379.62005 · doi:10.1214/aos/1176344136
[35] Scott, D., Multivariate density estimation: theory, practice, and visualization (2015), New York: Wiley, New York · Zbl 1311.62004 · doi:10.1002/9781118575574
[36] Scrucca, L., Identifying connected components in Gaussian finite mixture models for clustering, Comput Stat Data Anal, 93, 5-17 (2016) · Zbl 1468.62174 · doi:10.1016/j.csda.2015.01.006
[37] Scrucca L (2020) A fast and efficient modal EM algorithm for Gaussian mixtures. arXiv preprint arXiv:2002.03600
[38] Scrucca, L.; Raftery, AE, Improved initialisation of model-based clustering using Gaussian hierarchical partitions, Adv Data Anal Classif, 9, 4, 447-460 (2015) · Zbl 1414.62272 · doi:10.1007/s11634-015-0220-z
[39] Scrucca, L.; Fop, M.; Murphy, TB; Raftery, AE, mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, R J, 8, 1, 289 (2016) · doi:10.32614/RJ-2016-021
[40] Smyth, P.; Wolpert, D., Linearly combining density estimators via stacking, Mach Learn, 36, 1-2, 59-83 (1999) · doi:10.1023/A:1007511322260
[41] Spidlen, J.; Breuer, K.; Rosenberg, C.; Kotecha, N.; Brinkman, R., Flowrepository: a resource of annotated flow cytometry datasets associated with peer-reviewed publications, Cytom Part A, 81, 9, 727-731 (2012) · doi:10.1002/cyto.a.22106
[42] Strehl, A.; Ghosh, J., Cluster ensembles—a knowledge reuse framework for combining multiple partitions, J Mach Learn Res, 3, 583-617 (2002) · Zbl 1084.68759
[43] Stuetzle, W., Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample, J Classif, 20, 1, 025-047 (2003) · Zbl 1055.62075 · doi:10.1007/s00357-003-0004-6
[44] Tibshirani, R.; Wainwright, M.; Hastie, T., Statistical learning with sparsity: the lasso and generalizations (2015), London: Chapman and Hall, London · Zbl 1319.68003
[45] Viroli, C.; McLachlan, G., Deep Gaussian mixture models, Stat Comput, 29, 1, 43-51 (2019) · Zbl 1430.62143 · doi:10.1007/s11222-017-9793-z
[46] Wang K, Ng A, McLachlan G (2018) EMMIXskew: the EM algorithm and skew mixture distribution. https://CRAN.R-project.org/package=EMMIXskew. R package version 1.0.3
[47] Wei, Y.; McNicholas, PD, Mixture model averaging for clustering, Adv Data Anal Classif, 9, 2, 197-217 (2015) · Zbl 1414.62283 · doi:10.1007/s11634-014-0182-6
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.