×

Using projection-based clustering to find distance- and density-based clusters in high-dimensional data. (English) Zbl 07413949

Summary: For high-dimensional datasets in which clusters are formed by both distance and density structures (DDS), many clustering algorithms fail to identify these clusters correctly. This is demonstrated for 32 clustering algorithms using a suite of datasets which deliberately pose complex DDS challenges for clustering. In order to improve the structure finding and clustering in high-dimensional DDS datasets, projection-based clustering (PBC) is introduced. The coexistence of projection and clustering allows to explore DDS through a topographic map. This enables to estimate, first, if any cluster tendency exists and, second, the estimation of the number of clusters. A comparison showed that PBC is always able to find the correct cluster structure, while the performance of the best of the 32 clustering algorithms varies depending on the dataset.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)

References:

[1] Adolfsson, A.; Ackerman, M.; Brownstein, NC, To cluster, or not to cluster: an analysis of clusterability methods, Pattern Recognition, 88, 13-26 (2019) · doi:10.1016/j.patcog.2018.10.026
[2] Aeberhard, S.; Coomans, D.; De Vel, O., Comparison of classifiers in high dimensional settings, technical report 92-02 (1992), North Queensland: James Cook University of North Queensland, Department of Computer Science and Department of Mathematics and Statistics, North Queensland
[3] Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., & Park, J.S. (1999). Fast algorithms for projected clustering. Proc. ACM SIGMOD International Conference on Management of Data (Vol. 28, pp. 61-72) Philadelphia, Pennsylvania: Association for Computing Machinery.
[4] Aggarwal, CC; Yu, PS, Finding generalized projected clusters in high dimensional spaces, Proceedings of the ACM SIGMOD international conference on management of data, 70-81 (2000), New York: ACM, New York · doi:10.1145/342009.335383
[5] Agrawal, R.; Gehrke, J.; Gunopulos, D.; Raghavan, P., Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the ACM SIGMOD international conference on management of data, 94-105 (1998), Seattle: ACM, Seattle
[6] Anderson, E., The Irises of the Gaspé Peninsula, Bulletin of the American Iris Society, 59, 2-5 (1935)
[7] Arabie, P.; Hubert, L.; Bagozzi, RP, Cluster analysis in marketing research, Advanced methods of marketing research, 160-189 (1994), Oxford, England: Blackwell Business, Oxford, England
[8] Arabie, P.; Hubert, LJ; De Soete, G., Clustering and classification (1996), Singapore: World Scientific, Singapore · Zbl 0836.00014 · doi:10.1142/1930
[9] Aupetit, M., Visualizing distortions and recovering topology in continuous projection techniques, Neurocomputing, 70, 1304-1330 (2007) · doi:10.1016/j.neucom.2006.11.018
[10] Bock, HH; Bozdogan, H.; Gupta, AK, On the interface between cluster analysis, principal component analysis, and multidimensional scaling, Multivariate statistical modeling and data analysis, 17-34 (1987), Dordrecht: Springer, Dordrecht · Zbl 0627.62068 · doi:10.1007/978-94-009-3977-6_2
[11] Bonner, RE, On some clustering technique, IBM Journal of Research and Development, 8, 22-32 (1964) · Zbl 0116.09705 · doi:10.1147/rd.81.0022
[12] Chang, WC, On using principal components before separating a mixture of two multivariate normal distributions, Journal of the Royal Statistical Society: Series C: Applied Statistics, 32, 267-275 (1983) · Zbl 0538.62050
[13] Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2012). NbClust: An R Package for determining the relevant number of clusters in a data set. Journal of statistical Software, 61(6),1-36. doi:10.18637/jss.v061.i06
[14] Comon, P.; Lacoume, J., Independent component analysis, Higher-order statistics, 29-38 (1992), Amsterdam: Elsevier, Amsterdam
[15] Dasgupta, S.; Gupta, A., An elementary proof of a theorem of Johnson and Lindenstrauss, Random Structures & Algorithms, 22, 60-65 (2003) · Zbl 1018.51010 · doi:10.1002/rsa.10073
[16] De Soete, G.; Carroll, JD; Diday, E.; Lechevallier, Y.; Schader, M.; Bertrand, P.; Burtschy, B., K-means clustering in a low-dimensional Euclidean space, New approaches in classification and data analysis, 212-219 (1994), Berlin: Springer, Berlin · Zbl 1113.00306 · doi:10.1007/978-3-642-51175-2_24
[17] Defays, D., An efficient algorithm for a complete link method, The Computer Journal, 20, 364-366 (1977) · Zbl 0364.68038 · doi:10.1093/comjnl/20.4.364
[18] Demartines, P.; Hérault, J., CCA: “curvilinear component analysis”, 15° Colloque sur le Traitement du Signal et des Images, 921-924 (1995), France: GRETSI, Groupe d’Etudes du Traitement du Signal et des Images, France
[19] Dijkstra, EW, A note on two problems in connexion with graphs, Numerische Mathematik, 1, 269-271 (1959) · Zbl 0092.16002 · doi:10.1007/BF01386390
[20] Dimitriadou, E. (2002). cclust-convex clustering methods and clustering indexes. R Package Version 0.6-21.
[21] Dimitriadou, E.; Dolničar, S.; Weingessel, A., An examination of indexes for determining the number of clusters in binary data sets, Psychometrika, 67, 137-159 (2002) · Zbl 1297.62229 · doi:10.1007/BF02294713
[22] Duda, RO; Hart, PE; Stork, DG, Pattern classification (2001), New York: Wiley, New York · Zbl 0968.68140
[23] Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Proc. Second International Conference on Knowledge Discovery and Data Mining (KDD 96) (Vol. 96, pp. 226-231). Portland, Oregon: AAAI Press.
[24] Everitt, BS; Landau, S.; Leese, M., Cluster analysis (2001), London: Arnold, London · Zbl 1205.62076
[25] Everitt, BS; Landau, S.; Leese, M.; Stahl, D.; Everitt, BS; Landau, S.; Leese, M.; Stahl, D., Hierarchical clustering, Cluster analysis, 71-110 (2011), New York: Wiley, New York · Zbl 1274.62003 · doi:10.1002/9780470977811.ch4
[26] Fisher, RA, The use of multiple measurements in taxonomic problems, Annals of Eugenics, 7, 179-188 (1936) · doi:10.1111/j.1469-1809.1936.tb02137.x
[27] Florek, K., Łukaszewicz, J., Perkal, J., Steinhaus, H., & Zubrzycki, S. (1951). Sur la liaison et la division des points d’un ensemble fini. Proc. Colloquium Mathematicae (Vol. 2, pp. 282-285). Institute of Mathematics Polish Academy of Sciences. · Zbl 0045.26103
[28] Flury, B.; Riedwyl, H., Multivariate statistics, a practical approach (1988), London: Chapman and Hall, London · Zbl 0495.62057 · doi:10.1007/978-94-009-1217-5
[29] Fraley, C.; Raftery, AE, Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97, 611-631 (2002) · Zbl 1073.62545 · doi:10.1198/016214502760047131
[30] Fraley, C., & Raftery, A.E. (2006). MCLUST version 3: an R package for normal mixture modeling and model-based clustering Vol. Technical Report No. 504, Department of Statistics, University of Washington, Seattle.
[31] Franck, P.; Cameron, E.; Good, G.; Rasplus, JY; Oldroyd, BP, Nest architecture and genetic differentiation in a species complex of Australian stingless bees, Molecular Ecology, 13, 2317-2331 (2004) · doi:10.1111/j.1365-294X.2004.02236.x
[32] Frey, BJ; Dueck, D., Clustering by passing messages between data points, Science, 315, 972-976 (2007) · Zbl 1226.94027 · doi:10.1126/science.1136800
[33] Ge, R., Ester, M., Jin, W., & Davidson, I. (2007). Constraint-driven clustering Proc. 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 07) (pp. 320-329). San Jose, California: Association for Computing Machinery.
[34] Haferlach, T.; Kohlmann, A.; Wieczorek, L.; Basso, G.; Te Kronnie, G.; Béné, M-C; De Vos, J.; Hernández, JM; Hofmann, W-K; Mills, KI, Clinical utility of microarray-based gene expression profiling in the diagnosis and subclassification of leukemia: report from the international microarray innovations in leukemia study group, Journal of Clinical Oncology, 28, 2529-2537 (2010) · doi:10.1200/JCO.2009.23.4732
[35] Handl, J.; Knowles, J.; Kell, DB, Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201-3212 (2005) · doi:10.1093/bioinformatics/bti517
[36] Hennig, C.; Spiliopoulou, M.; Schmidt-Thieme, L.; Janning, R., How many bee species? A case study in determining the number of clusters, Data analysis, machine learning and knowledge discovery, 41-49 (2014), Berlin: Springer, Berlin · doi:10.1007/978-3-319-01595-8_5
[37] Hennig, C., Handbook of cluster analysis (2015), New York: Chapman & Hall/CRC, New York · doi:10.1201/b19706
[38] Herrero, J.; Valencia, A.; Dopazo, J., A hierarchical unsupervised growing neural network for clustering gene expression patterns, Bioinformatics, 17, 126-136 (2001) · doi:10.1093/bioinformatics/17.2.126
[39] Heyer, LJ; Kruglyak, S.; Yooseph, S., Exploring expression data: identification and analysis of coexpressed genes, Genome Research, 9, 1106-1115 (1999) · doi:10.1101/gr.9.11.1106
[40] Hinton, GE; Roweis, ST, Stochastic neighbor embedding, Advances in neural information processing systems, 833-840 (2002), Cambridge: MIT Press, Cambridge
[41] HINTZE, JL; NELSON, RD, Violin plots: a box plot-density trace synergism, The American Statistician, 52, 181-184 (1998)
[42] Hofmeyr, D.; Pavlidis, N., Maximum clusterability divisive clustering, 2015 IEEE symposium series on computational intelligence, 780-786 (2015), Piscataway, NJ: IEEE, Piscataway, NJ · doi:10.1109/SSCI.2015.116
[43] Hofmeyr, D.; Pavlidis, N., PPCI: an R package for cluster identification using projection pursuit, The R Journal, 11, 152 (2019) · doi:10.32614/RJ-2019-046
[44] Hofmeyr, DP, Clustering by minimum cut hyperplanes, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 1547-1560 (2016) · doi:10.1109/TPAMI.2016.2609929
[45] Hotelling, H., Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology, 24, 417-441 (1933) · JFM 59.1182.04 · doi:10.1037/h0071325
[46] Hubert, L.; Arabie, P., Comparing partitions, Journal of Classification, 2, 193-218 (1985) · Zbl 0587.62128 · doi:10.1007/BF01908075
[47] Jain, AK; Dubes, RC, Algorithms for clustering data (1988), Englewood Cliffs: Prentice Hall College Div, Englewood Cliffs · Zbl 0665.62061
[48] Johnson, W. B., & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, 26(1), 189-206. · Zbl 0539.46017
[49] Kaufman, L.; Rousseeuw, PJ; Kaufman, L.; Rousseeuw, PJ, Partitioning around medoids (program PAM), Finding groups in data: An introduction to cluster analysis, 68-125 (1990), Hoboken, NJ: Wiley, Hoboken, NJ · Zbl 1345.62009 · doi:10.1002/9780470316801.ch2
[50] Kaufman, L.; Rousseeuw, PJ, Finding groups in data: an introduction to cluster analysis (2005), Hoboken: Wiley, Hoboken · Zbl 1345.62009
[51] Kim, J., Emergence: core ideas and issues, Synthese, 151, 547-559 (2006) · doi:10.1007/s11229-006-9025-0
[52] Kleinberg, J., An impossibility theorem for clustering, Advances in neural information processing systems, 463-470 (2003), Vancouver, British Columbia: MIT Press, Vancouver, British Columbia
[53] Lance, GN; Williams, WT, Computer programs for hierarchical polythetic classification (“similarity analyses”), The Computer Journal, 9, 60-64 (1966) · Zbl 0136.38807 · doi:10.1093/comjnl/9.1.60
[54] Lance, GN; Williams, WT, A generalized sorting strategy for computer classifications, Nature, 212, 218 (1966) · doi:10.1038/212218a0
[55] Lance, GN; Williams, WT, A general theory of classificatory sorting strategies: 1. Hierarchical systems, The Computer Journal, 9, 373-380 (1967) · doi:10.1093/comjnl/9.4.373
[56] Lichman, M., UCI machine learning repository (2013), Irvine: University of California, School of Information and Computer Science, Irvine
[57] Linde, Y.; Buzo, A.; Gray, R., An algorithm for vector quantizer design, IEEE Transactions on Communications, 28, 84-95 (1980) · doi:10.1109/TCOM.1980.1094577
[58] Lötsch, J.; Ultsch, A., Exploiting the structures of the U-matrix, Advances in self-organizing maps and learning vector quantization, 249-257 (2014), Mittweida: Springer International Publishing, Mittweida · doi:10.1007/978-3-319-07695-9_24
[59] Markos, A.; Iodice D’Enza, A.; van de Velden, M., Beyond tandem analysis: joint dimension reduction and clustering in R, Journal of Statistical Software (Online), 91, 1-24 (2019)
[60] Martinetz, TM; Berkovich, SG; Schulten, KJ, ‘Neural-gas’ network for vector quantization and its application to time-series prediction, IEEE Transactions on Neural Networks, 4, 558-569 (1993) · doi:10.1109/72.238311
[61] McQuitty, LL, Similarity analysis by reciprocal pairs for discrete and continuous data, Educational and Psychological Measurement, 26, 825-831 (1966) · doi:10.1177/001316446602600402
[62] Milligan, GW; Cooper, MC, A study of standardization of variables in cluster analysis, Journal of Classification, 5, 181-204 (1988) · doi:10.1007/BF01897163
[63] Mirkin, BG, Clustering: a data recovery approach (2005), Boca Raton: Chapman & Hall/CRC, Boca Raton · Zbl 1083.68099 · doi:10.1201/9781420034912
[64] Ng, AY; Jordan, MI; Weiss, Y., On spectral clustering: analysis and an algorithm, Advances in Neural Information Processing Systems, 2, 849-856 (2002)
[65] Niu, D., Dy, J., & Jordan, M. (2011). Dimensionality reduction for spectral clustering. In Gordon, G., Dunson, D. & Dudík, M. (eds.), Proc. Fourteenth International Conference on Artificial Intelligence and Statistics (Vol. 15, pp. 552-560). Fort Lauderdale, FL: PMLR.
[66] Patterson, T.; Kelso, NV, Hal Shelton revisited: designing and producing natural-color maps with satellite land cover data, Cartographic Perspectives, 47, 28-55 (2004) · doi:10.14714/CP47.470
[67] Pavlidis, NG; Hofmeyr, DP; Tasoulis, SK, Minimum density hyperplanes, The Journal of Machine Learning Research, 17, 5414-5446 (2016) · Zbl 1392.68362
[68] Pearson, K., LIII. On lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2, 559-572 (1901) · JFM 32.0246.07 · doi:10.1080/14786440109462720
[69] R Development Core Team, R: a language and environment for statistical computing (Version 3.2.5) (2008), Vienna: R Foundation for Statistical Computing, Vienna
[70] Ritter, G., Robust cluster analysis and variable selection (2014), Passau: Chapman & Hall/CRC, Passau · Zbl 1341.62037 · doi:10.1201/b17353
[71] Rodriguez, A.; Laio, A., Clustering by fast search and find of density peaks, Science, 344, 1492-1496 (2014) · doi:10.1126/science.1242072
[72] Rousseeuw, PJ; Kaufman, L., Finding groups in data (1990), Brussels: Wiley, Brussels · Zbl 1345.62009
[73] Scharl, T.; Leisch, F., The stochastic QT-clust algorithm: evaluation of stability and variance on time-course microarray data, Proceedings in computational statistics (Compstat), 1015-1022 (2006), Heidelberg: Physica Verlag, Heidelberg
[74] Sokol, RR; Michener, CD, A statistical method for evaluating systematic relationships, Univ Kansas Science Bulletin, 28, 1409-1438 (1958)
[75] Steinley, D.; Brusco, MJ, Initializing k-means batch clustering: a critical evaluation of several techniques, Journal of Classification, 24, 99-121 (2007) · Zbl 1144.62331 · doi:10.1007/s00357-007-0003-0
[76] Steinley, D.; Brusco, MJ; Henson, R., Principal cluster axes: a projection pursuit index for the preservation of cluster structures in the presence of data reduction, Multivariate Behavioral Research, 47, 463-492 (2012) · doi:10.1080/00273171.2012.673952
[77] Theodoridis, S.; Koutroumbas, K., Pattern recognition (2009), Montreal: Elsevier, Montreal · Zbl 0954.68131
[78] Thrun, MC, Projection based clustering through self-organization and swarm intelligence (2018), Heidelberg: Springer, Heidelberg · doi:10.1007/978-3-658-20540-9
[79] Thrun, M.C., Gehlert, T., & Ultsch, A. (2020). Analyzing the fine structure of distributions. Preprint available at arXiv.org, PLOS ONE, in revision arXiv:1908.06081.
[80] Thrun, M.C., Lerch, F., Lötsch, J., & Ultsch, A. (2016). Visualization and 3D printing of multivariate data of biomarkers. In International conference in Central Europe on computer graphics, visualization and computer vision (WSCG) (pp. 7-16). Plzen.
[81] Thrun, MC; Ultsch, A., Clustering benchmark datasets exploiting the fundamental clustering problems, Data in Brief, 30 C, 105501 (2020) · doi:10.1016/j.dib.2020.105501
[82] Thrun, M. C., & Ultsch, A. (2020b). Swarm intelligence for self-organized clustering. Journal of Artificial Intelligence, 103237. doi:10.1016/j.artint.2020.103237. · Zbl 1504.68191
[83] Timmerman, ME; Ceulemans, E.; Kiers, HA; Vichi, M., Factorial and reduced K-means reconsidered, Computational Statistics & Data Analysis, 54, 1858-1871 (2010) · Zbl 1284.62396 · doi:10.1016/j.csda.2010.02.009
[84] Torgerson, WS, Multidimensional scaling: I. Theory and method, Psychometrika, 17, 401-419 (1952) · Zbl 0049.37603 · doi:10.1007/BF02288916
[85] Tukey, J. W. (1977). Exploratory data analysis. Reading: United States Addison-Wesley Publishing Company. · Zbl 0409.62003
[86] Tung, A. K.., Han, J., Lakshmanan, L. V., & Ng, R.T. (2001). Constraint-based clustering in large databases. In Van den Bussche, J. & Vianu, V. (eds.), Proc. International Conference on Database Theory (ICDT) (Vol. 1973, pp. 405-419). Berlin, Heidelberg, London: Springer. · Zbl 1047.68594
[87] Ultsch, A. (1995). Self organizing neural networks perform different from statistical k-means clustering. Proc. society for information and classification (GFKL) (Vol. 1995). Basel.
[88] Ultsch, A. (2005a). Clustering wih SOM: U*C. In Proceedings of the 5th workshop on self-organizing maps (pp. 75-82), Paris, France.
[89] Ultsch, A.; BAIER, D.; Werrnecke, KD, Pareto density estimation: A density estimation for knowledge discovery, Innovations in classification, data science, and information systems, 91-100 (2005), Berlin, Germany: Springer, Berlin, Germany · Zbl 1448.62105 · doi:10.1007/3-540-26981-9_12
[90] Ultsch, A., Emergence in self-organizing feature maps, 6th workshop on self-organizing maps (WSOM 07), 1-7 (2007), Bielefeld, Germany: University Library of Bielefeld, Bielefeld, Germany
[91] Ultsch, A.; Behnisch, M.; Lötsch, J.; Merényi, E.; Mendenhall, JM; O’Driscoll, P., ESOM visualizations for quality assessment in clustering, Advances in self-organizing maps and learning vector quantization: proceedings of the 11th international workshop WSOM 2016, Houston, Texas, USA, January 6-8, 2016, 39-48 (2016), Cham: Springer International Publishing, Cham · Zbl 1330.68031 · doi:10.1007/978-3-319-28518-4_3
[92] Ultsch, A., & Herrmann, L. (2005). The architecture of emergent self-organizing maps to reduce projection errors. In Verleysen, M. (Ed.), Proc. European Symposium on Artificial Neural Networks (ESANN) (pp. 1-6). Belgium: Bruges.
[93] Ultsch, A.; Lötsch, J., Machine-learned cluster identification in high-dimensional data, Journal of Biomedical Informatics, 66, 95-104 (2017) · doi:10.1016/j.jbi.2016.12.011
[94] Ultsch, A.; Thrun, MC, Credible visualizations for planar projections, 12th international workshop on self-organizing maps and learning vector quantization, clustering and data visualization (WSOM), 1-5 (2017), Nany: IEEE, Nany
[95] Ultsch, A., & Vetter, C. (1995). Self organizing neural networks perform different from statistical k-means clustering Proc. Society for Information and Classification (GFKL) (Vol. 1995) Basel 8th-10th.
[96] van der Maaten, LJP; Postma, EO; van den Herik, HJ, Dimensionality reduction: A comparative review, Journal of Machine Learning Research, 10, 66-71 (2009)
[97] Van Dongen, S.M. (2000). Graph clustering by flow simulation. Utrecht, Netherlands: Ph.D. thesis University of Utrecht.
[98] Venna, J.; Peltonen, J.; Nybo, K.; Aidos, H.; Kaski, S., Information retrieval perspective to nonlinear dimensionality reduction for data visualization, The Journal of Machine Learning Research, 11, 451-490 (2010) · Zbl 1242.62006
[99] Vichi, M.; Kiers, HAL, Factorial k-means analysis for two-way data, Computational Statistics & Data Analysis, 37, 49-64 (2001) · Zbl 1051.62056 · doi:10.1016/S0167-9473(00)00064-5
[100] Ward, JH Jr, Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association, 58, 236-244 (1963) · doi:10.1080/01621459.1963.10500845
[101] Wehrens, R.; Buydens, LMC, Self-and super-organizing maps in R: the Kohonen package, Journal of Statistical Software, 21, 1-19 (2007) · doi:10.18637/jss.v021.i05
[102] Weinstein, JN; Collisson, EA; Mills, GB; Shaw, KRM; Ozenberger, BA; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, JM; Cancer Genome Atlas Research Network, The cancer genome atlas pan-cancer analysis project, Nature Genetics, 45, 1113-1120 (2013) · doi:10.1038/ng.2764
[103] Wickham, H., & Stryjewski, L. (2011). 40 years of boxplots. The American Statistician.
[104] Wolberg, WH; Mangasarian, OL, Multisurface method of pattern separation for medical diagnosis applied to breast cytology, Proceedings of the National Academy of Sciences of the United States of America, 87, 9193-9196 (1990) · Zbl 0709.92537 · doi:10.1073/pnas.87.23.9193
[105] Zhang, B., Dependence of clustering algorithm performance on clustered-ness of data, technical report HPL-2000-137 (2001), Palo Alto: Hewlett-Packard Labs, Palo Alto
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.