×

zbMATH — the first resource for mathematics

Clustering with the average silhouette width. (English) Zbl 07345453
Summary: The Average Silhouette Width (ASW) is a popular cluster validation index to estimate the number of clusters. The question whether it also is suitable as a general objective function to be optimized for finding a clustering is addressed. Two algorithms (the standard version OSil and a fast version FOSil) are proposed, and they are compared with existing clustering methods in an extensive simulation study covering known and unknown numbers of clusters. Real data sets are analysed, partly exploring the use of the new methods with non-Euclidean distances. The ASW is shown to satisfy some axioms that have been proposed for cluster quality functions. The new methods prove useful and sensible in many cases, but some weaknesses are also highlighted. These also concern the use of the ASW for estimating the number of clusters together with other methods, which is of general interest due to the popularity of the ASW for this task.
MSC:
62-XX Statistics
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Ackerman, M.; Ben-David, S., Measures of clustering quality: A working set of axioms for clustering, (Advances in Neural Information Processing Systems (2009)), 121-128
[2] Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Perez, J. M.; Perona, I., An extensive comparative study of cluster validity indices, Pattern Recognit., 46, 243-256 (2012)
[3] Batool, F., Initialization methods for optimum average silhouette width clustering (2020), arXiv preprint arXiv:1910.08644
[4] Bernard, E.; Naveau, P.; Vrac, M.; Mestre, O., Clustering of maxima: Spatial dependencies among heavy rainfall in france, J. Clim., 26, 7929-7937 (2013)
[5] Biase, F. H.; Cao, X.; Zhong, S., Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell rna sequencing, Genome Res., 24, 1787-1796 (2014)
[6] Caliński, T.; Harabasz, J., A dendrite method for cluster analysis, Comm. Statist. Theory Methods, 3, 1-27 (1974) · Zbl 0273.62010
[7] Carlsson, G.; Mémoli, F., Classifying clustering schemes, Found. Comput. Math., 13, 221-252 (2013) · Zbl 1358.62057
[8] Correa-Morris, J., An indication of unification for different clustering approaches, Pattern Recognit., 46, 2548-2561 (2013) · Zbl 1323.68431
[9] Fisher, L.; Ness, J. W.V., Admissible clustering procedures, Biometrika, 58, 91-104 (1971) · Zbl 0224.62030
[10] Fraley, C.; Raftery, A. E., How many clusters? which clustering method? answers via model-based cluster analysis, Comput. J., 41, 578-588 (1998) · Zbl 0920.68038
[11] Goolam, M.; Scialdone, A.; Graham, S. J.; Macaulay, I. C.; Jedrusik, A.; Hupalowska, A.; Voet, T.; Marioni, J. C.; Zernicka-Goetz, M., Heterogeneity in oct4 and sox2 targets biases cell fate in 4-cell mouse embryos, Cell, 165, 61-74 (2016)
[12] Halkidi, M.; Vazirgiannis, M.; Hennig, C., Method-independent indices for cluster validation and estimating the number of clusters, (Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Handbook of Cluster Analysis (2015), CRC Press), 595-618 · Zbl 1396.62136
[13] Hartigan, J. A.; Wong, M. A., Algorithm as 136: A k-means clustering algorithm, J. R. Stat. Soc. Ser. C. Appl. Stat., 28, 100-108 (1979) · Zbl 0447.62062
[14] Hennig, C., Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods, J. Multivariate Anal., 99, 1154-1176 (2008) · Zbl 1141.62052
[15] Hennig, C., What are the true clusters?, Pattern Recognit. Lett., 64, 53-62 (2015)
[16] Hennig, C.; Lin, C. J., Flexible parametric bootstrap for testing homogeneity against clustering and assessing the number of clusters, Stat. Comput., 25, 821-833 (2015) · Zbl 1331.62308
[17] Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Handbook of Cluster Analysis (2015), CRC Press: CRC Press Boca Raton, USA
[18] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 193-218 (1985)
[19] Jardine, N.; Sibson, R., The construction of hierarchic and non-hierarchic classifications, Comput. J., 11, 177-184 (1968) · Zbl 0164.46207
[20] Kaufman, L.; Rousseeuw, P., Clustering by Means of Medoids (1987), North-Holland: North-Holland Amsterdam
[21] Kaufman, L.; Rousseeuw, P. J., Finding Groups in Data: An Introduction to Cluster Analysis. Volume, 344 (1990), John Wiley & Sons
[22] Kleinberg, J. M., An impossibility theorem for clustering, (Advances in Neural Information Processing Systems (2003)), 463-470
[23] Kolodziejczyk, A. A.; Kim, J. K.; Tsang, J. C.; Ilicic, T.; Henriksson, J.; Natarajan, K. N.; Tuck, A. C.; Gao, X.; Bühler, M.; Liu, P., Single cell rna-sequencing of pluripotent states unlocks modular transcriptional variation, Cell Stem Cell, 17, 471-485 (2015)
[24] Van der Laan, M.; Pollard, K.; Bryan, J., A new partitioning around medoids algorithm, J. Stat. Comput. Simul., 73, 575-584 (2003) · Zbl 1054.62075
[25] Lloyd, S., Least squares quantization in pcm, IEEE Trans. Inform. Theory, 28, 129-137 (1982) · Zbl 0504.94015
[26] Lun, A. T.; McCarthy, D. J.; Marioni, J. C., A Step-By-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor, 5 (2016), F1000Research
[27] Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K., Cluster: Cluster analysis basics and extensions. R package version 2.0.6 (2017)
[28] Martinez-Ortega, M. M.; Delgado, L.; Albach, D. C.; Elena-Rossello, J. A.; Rico, E., Species boundaries and phylogeographic patterns in cryptic taxa inferred from aflp markers: Veronica subgen. pentasepalae (scrophulariaceae) in the western mediterranean, Syst. Biol., 29, 965-986 (2004)
[29] Ng, A. Y.; Jordan, M. I.; Weiss, Y., On spectral clustering: Analysis and an algorithm, (NIPS (2001)), 849-856
[30] Rousseeuw, P. J., Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math., 20, 53-65 (1987) · Zbl 0636.62059
[31] Rubin, J., Optimal classification into groups: an approach for solving the taxonomy problem, J. Theoret. Biol., 15, 103-144 (1967)
[32] Scrucca, L.; Fop, M.; Murphy, T. B.; Raftery, A. E., Mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, R J., 8, 205-233 (2017)
[33] Shi, G. R., Multivariate data analysis in palaeoecology and palaeobiogeography-a review, Palaeogeogr. Palaeoclimatol. Palaeoecol., 105, 199-234 (1993)
[34] Sokal, R. R.; Michener, C. D., A statistical method for evaluating systematic relationships, Univ. Kansas Sci. Bull., 38, 1409-1438 (1958)
[35] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a data set via the gap statistic, J. R. Stat. Soc. Ser. B Stat. Methodol., 63, 411-423 (2001) · Zbl 0979.62046
[36] Von Luxburg, U., Williamson, R.C., Guyon, I., 2012. Clustering: Science or art? In: Proceedings of ICML Workshop on Unsupervised and Transfer Learning, pp. 65-79.
[37] Ward, J. H., Hierarchical grouping to optimize an objective function, J. Amer. Statist. Assoc., 58, 236-244 (1963)
[38] Wright, W. E., A formalization of cluster analysis, Pattern Recognit., 5, 273-282 (1973)
[39] Yan, L.; Yang, M.; Guo, H.; Yang, L.; Wu, J.; Li, R.; Liu, P.; Lian, Y.; Zheng, X.; Yan, J., Single-cell rna-seq profiling of human preimplantation embryos and embryonic stem cells, Nat. Struct. Mol. Biol., 20, 1131 (2013)
[40] Zadeh, R. B.; Ben-David, S., A uniqueness theorem for clustering, (Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (2009), AUAI Press), 639-646
[41] Zeileis, A.; Hornik, K.; Smola, A.; Karatzoglou, A., Kernlab – an s4 package for kernel methods in r, J. Stat. Softw., 11, 1-20 (2004)
[42] Zeisel, A.; Muñoz Manchado, A. B.; Codeluppi, S.; Lönnerberg, P.; La Manno, G.; Juréus, A.; Marques, S.; Munguba, H.; He, L.; Betsholtz, C., Cell types in the mouse cortex and hippocampus revealed by single-cell rna-seq, Science, 347, 1138-1142 (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.