×

A new nonparametric interpoint distance-based measure for assessment of clustering. (English) Zbl 07498025

Summary: A new interpoint distance-based measure is proposed to identify the optimal number of clusters present in a data set. Designed in nonparametric approach, it is independent of the distribution of given data. Interpoint distances between the data members make our cluster validity index applicable to univariate and multivariate data measured on arbitrary scales, or having observations in any dimensional space where the number of study variables can be even larger than the sample size. Our proposed criterion is compatible with any clustering algorithm and can be used to determine the unknown number of clusters or to assess the quality of the resulting clusters for a data set. Demonstration through synthetic and real-life data establishes its superiority over the well-known clustering accuracy measures of the literature.

MSC:

62-XX Statistics
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Jain, AK; Murty, MN; Flynn, PJ., Data clustering: a review, ACM Comput Surv, 31, 264-323 (1999)
[2] McLachlan, G.; Peel, D., Finite mixture models (2000), New York (NY): John Wiley & Sons, New York (NY) · Zbl 0963.62061
[3] Kaufman, L.; Rousseeuw, PJ., Finding groups in data: an introduction to cluster analysis (2005), New Jersey: John Wiley & Sons, New Jersey
[4] Cheng, D.; Zhu, Q.; Huang, J., Natural neighbor-based clustering algorithm with local representatives, Knowl Based Syst, 123, 238-253 (2017)
[5] Cheng, D.; Zhu, Q.; Huang, J., A local cores-based hierarchical clustering algorithm for data sets with complex structures, Neural Comput Appl, 31, 8051-8068 (2018)
[6] Cheng, D.; Zhu, Q.; Huang, J., Clustering with local density peaks-based minimum spanning tree, IEEE Trans Knowl Data Eng, 33, 374-387 (2021)
[7] Matioli, LC; Santos, SR; Kleina, M., A new algorithm for clustering based on kernel density estimation, J Appl Stat, 45, 347-366 (2018) · Zbl 07282433
[8] Modak, S.; Chattopadhyay, AK; Chattopadhyay, T., Clustering of gamma-ray bursts through kernel principal component analysis, Commun Stat - Simul Comput, 47, 1088-1102 (2018)
[9] Modak, S.; Chattopadhyay, T.; Chattopadhyay, AK., Unsupervised classification of eclipsing binary light curves through k-medoids clustering, J Appl Stat, 47, 376-392 (2020) · Zbl 07481420
[10] Modak, S, Chattopadhyay, AK, Chattopadhyay, T. Clustering of eclipsing binary light curves through functional principal component analysis. Submitted to journal for publication; 2021.
[11] Tarnopolski, M., Analysis of the duration-hardness ratio plane of gamma-ray bursts using skewed distributions, Astrophys J, 870, 105 (2019)
[12] Tóth, BG; Rácz, II; Horváth, I., Gaussian-mixture-model-based cluster analysis of gamma-ray bursts in the BATSE catalog, Mon Not R Astron Soc, 486, 4823-4828 (2019)
[13] Schwarz, G., Estimating the dimension of a model, Ann Stat, 6, 461-464 (1978) · Zbl 0379.62005
[14] Kass, RE; Raftery, AE., Bayes factors, J Am Stat Assoc, 90, 773-795 (1995) · Zbl 0846.62028
[15] Frayley, C.; Raftery, AE., How many clusters? Which clustering method? Answers via model-based cluster analysis, Comput J, 41, 578-588 (1998) · Zbl 0920.68038
[16] Sugar, CA; James, GM., Finding the number of clusters in a dataset, J Am Stat Assoc, 98, 750-763 (2003) · Zbl 1046.62064
[17] Tibshirani, R.; Walther, G.; Hastie, T., Estimating the number of clusters in a data set via the gap statistic, J R Stat Soc Ser B, 63, 411-423 (2001) · Zbl 0979.62046
[18] Dunn, JC., Well-separated clusters and optimal fuzzy partitions, J Cybern, 4, 95-104 (1974) · Zbl 0304.68093
[19] Handl, J.; Knowles, K.; Kell, D., Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201-3212 (2005)
[20] Caliński, T.; Harabasz, J., A dendrite method for cluster analysis, Commun Stat - Theory Methods, 3, 1-27 (1974) · Zbl 0273.62010
[21] Ripley, BD., Pattern recognition and neural networks (1996), Cambridge: Cambridge University Press, Cambridge
[22] Rousseeuw, PJ., Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J Comput Appl Math, 20, 53-65 (1987) · Zbl 0636.62059
[23] Cheng, D.; Zhu, Q.; Huang, J., A novel cluster validity index based on local cores, IEEE Trans Neural Netw Learn Syst, 30, 985-999 (2019)
[24] Nelsen, RB., An introduction to copulas (2006), New York (NY): Springer Science+Business, New York (NY)
[25] Modak, S.; Bandyopadhyay, U., A new nonparametric test for two sample multivariate location problem with application to astronomy, J Stat Theory Appl, 18, 136-146 (2019)
[26] Vanisma, F.; De Greve, JP., Close binary systems before and after mass transfer, Astrophys Space Sci, 87, 377-401 (1972)
[27] Bandyopadhyay, U.; Modak, S., Bivariate density estimation using normal-gamma kernel with application to astronomy, J Appl Probab Stat, 13, 23-39 (2018)
[28] Modak, S., Distinction of groups of gamma-ray bursts in the BATSE catalog through fuzzy clustering, Astron Comput, 34 (2021)
[29] Hartigan, JA; Wong, MA., A K-means clustering algorithm, Appl Stat, 28, 100-108 (1979) · Zbl 0447.62062
[30] Ester, M, Kriegel, H-P, Sander, J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, Oregon. AAAI Press; 1996. p. 226-231.
[31] Campello, RJGB, Moulavi, D, Sander, J. Density-based clustering based on hierarchical density estimates. Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery in Databases (PAKDD 2013); 2013. p. 160-172. (Lecture notes in computer science; 7819). Berlin, Heidelberg: Springer.
[32] Norris, JP; Cline, TL; Desai, UD, Frequency of fast, narrow γ-ray bursts, Nature, 308, 434-435 (1984)
[33] Kouveliotou, C.; Meegan, CA; Fishman, GJ, Identification of two classes of gamma-ray bursts, Astrophys J, 413, L101 (1993)
[34] Mukherjee, S.; Feigelson, ED; Babu, GJ, Three types of gamma-ray bursts, Astrophys J, 508, 314-327 (1998)
[35] Tarnopolski, M., On the limit between short and long GRBs, Astrophys Space Sci, 359, 20 (2015)
[36] Schölkopf, B.; Smola, A., Learning with kernels: support vector machines, regularization, optimization, and beyond (2002), Cambridge: MIT Press, Cambridge
[37] Modak, S.; Chattopadhyay, T.; Chattopadhyay, AK., Two phase formation of massive elliptical galaxies: study through cross-correlation including spatial effect, Astrophys Space Sci, 362, 206-215 (2017)
[38] Balastegui, A.; Ruiz-Lapuente, P.; Canal, R., Reclassification of gamma-ray bursts, Mon Not R Astron Soc, 328, 283-290 (2001)
[39] Chattopadhyay, T.; Misra, R.; Chattopadhyay, AK, Statistical evidence for three classes of gamma-ray bursts, Astrophys J, 667, 1017-1023 (2007)
[40] King, A.; Olsson, E.; Davies, MB., A new type of long gamma-ray burst, Mon Not R Astron Soc, 374, L34-L36 (2007)
[41] Veres, P.; Bagoly, Z.; Horváth, I., A distinct peak-flux distribution of the third class of gamma-ray bursts: a possible signature of X-ray flashes?, Astrophys J, 725, 1955-1964 (2010)
[42] Horváth, I.; Tóth, BG; Hakkila, J., Classifying GRB 170817A/GW170817 in a Fermi duration-hardness plane, Astrophys Space Sci, 363, 53 (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.