×

zbMATH — the first resource for mathematics

Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes. (English) Zbl 1452.62430
Summary: A key issue in cluster analysis is the choice of an appropriate clustering method and the determination of the best number of clusters. Different clusterings are optimal on the same data set according to different criteria, and the choice of such criteria depends on the context and aim of clustering. Therefore, researchers need to consider what data analytic characteristics the clusters they are aiming at are supposed to have, among others within-cluster homogeneity, between-clusters separation, and stability. Here, a set of internal clustering validity indexes measuring different aspects of clustering quality is proposed, including some indexes from the literature. Users can choose the indexes that are relevant in the application at hand. In order to measure the overall quality of a clustering (for comparing clusterings from different methods and/or different numbers of clusters), the index values are calibrated for aggregation. Calibration is relative to a set of random clusterings on the same data. Two specific aggregated indexes are proposed and compared with existing indexes on simulated and real data.
MSC:
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P30 Applications of statistics in engineering and industry; control charts
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Arbelaitz, O.; Gurrutxaga, I.; Muguerza, J.; Perez, JM; Perona, I., An extensive comparative study of cluster validity indices, Pattern Recognit., 46, 243-256 (2012)
[2] Caliński, T.; Harabasz, J., A dendrite method for cluster analysis, Commun. Stat. Theory Methods, 3, 1, 1-27 (1974) · Zbl 0273.62010
[3] Charytanowicz, M.; Niewczas, J.; Kulczycki, P.; Kowalski, PA; Łukasik, S.; Żak, S.; Pitka, E.; Kawa, J., Complete gradient clustering algorithm for features analysis of x-ray images, Information Technologies in Biomedicine, 15-24 (2010), Berlin: Springer, Berlin
[4] Delattre, M.; Hansen, P., Bicriterion cluster analysis, IEEE Trans. Pattern Anal. Mach. Intell., 4, 277-291 (1980) · Zbl 0458.62049
[5] Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
[6] Dias, D.B., Madeo, R.C., Rocha, T., Bíscaro, H.H., Peres, S.M.:. Hand movement recognition for brazilian sign language: a study using distance-based neural networks. In: International Joint Conference on Neural Networks, 2009. IJCNN 2009, pp. 697-704. IEEE (2009). 10.1109/IJCNN.2009.5178917
[7] Dunn, JC, Well-separated clusters and optimal fuzzy partitions, J. Cybern., 4, 1, 95-104 (1974) · Zbl 0304.68093
[8] Fang, Y.; Wang, J., Selection of the number of clusters via the bootstrap method, Comput. Stat. Data Anal., 56, 3, 468-477 (2012) · Zbl 1239.62076
[9] Forina, M.; Leardi, R.; Armanino, C.; Lanteri, S.; Conti, P.; Princi, P., Parvus: An extendable package of programs for data exploration, classification and correlation, J. Chemom., 4, 2, 191-193 (1990)
[10] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis and density estimation, J. Am. Stat. Assoc., 97, 4, 611-631 (2002) · Zbl 1073.62545
[11] Gelman, A.; Hennig, C., Beyond subjective and objective in statistics, J. R. Stat. Soc.: Ser. A (Stat. Soc.), 180, 4, 967-1033 (2017)
[12] Halkidi, M.; Vazirgiannis, M.; Hennig, C.; Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Method-independent indices for cluster validation and estimating the number of clusters, Handbook of Cluster Analysis, 595-618 (2015), Boca Raton: CRC Press, Boca Raton
[13] Handl, J.; Knowles, J.; Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Nature-inspired clustering, Handbook of Cluster Analysis, 419-439 (2015), Boca Raton: CRC Press, Boca Raton
[14] Hennig, C., Cluster-wise assessment of cluster stability, Comput. Stat. Data Anal., 52, 258-271 (2007) · Zbl 1452.62447
[15] Hennig, C., What are the true clusters?, Pattern Recognit. Lett., 64, 53-62 (2015)
[16] Hennig, C.; Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Clustering strategy and method selection, Handbook of Cluster Analysis, 703-730 (2015), Boca Raton: CRC Press, Boca Raton
[17] Hennig, C.; Skiadas, CH; Bozeman, JR, Cluster validation by measurement of clustering characteristics relevant to the user, Data Analysis and Applications 1: Clustering and Regression. Modeling—Estimating, Forecasting and Data Mining, 1-24 (2019), London: ISTE Ltd., London
[18] Hennig, C.; Liao, TF, How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification (with discussion), J. Roy. Stat. Soc.: Ser. C (Appl. Stat.), 62, 3, 309-369 (2013)
[19] Hubert, L.; Arabie, P., Comparing partitions, J. Classif., 2, 193-218 (1985)
[20] Hubert, L.; Schultz, J., Quadratic assignment as a general data analysis strategy, Br. J. Math. Stat. Psychol., 29, 2, 190-241 (1976) · Zbl 0356.92027
[21] Jain, AK; Dubes, RC, Algorithms for Clustering Data (1988), Englewood Cliffs: Prentice Hall, Englewood Cliffs
[22] Kaufman, L.; Rousseeuw, PJ, Finding Groups in Data: An Introduction to Cluster Analysis (1990), New York: Wiley, New York
[23] Leisch, F., A toolbox for k-centroids cluster analysis, Comput. Stat. Data Anal., 51, 2, 526-544 (2006) · Zbl 1157.62439
[24] Liu, Y.; Li, Z.; Xiong, H.; Gao, X.; Wu, J.; Wu, S., Understanding and enhancement of internal clustering validation measures, IEEE Trans. Cybern., 43, 3, 982-994 (2013)
[25] Lloyd, S., Least squares quantization in pcm, IEEE Trans. Inf. Theor., 28, 2, 129-137 (1982) · Zbl 0504.94015
[26] Milligan, G.; Cooper, M., An examination of procedures for determining the number of clusters in a data set, Psychometrika, 50, 3, 159-179 (1985)
[27] Seber, GAF, Multivariate Observations (1983), New York: Wiley, New York
[28] Shannon, CE, A mathematical theory of communication, Bell Syst. Tech. J., 27, 3, 379-423 (1948) · Zbl 1154.94303
[29] Tibshirani, R.; Walther, G., Cluster validation by prediction strength, J. Comput. Graph. Stat., 14, 3, 511-528 (2005)
[30] Walesiak, M., Dudek, A.: clusterSim package (2011). https://cran.r-project.org/web/packages/clusterSim/
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.