×

MDCGen: multidimensional dataset generator for clustering. (English) Zbl 1436.62262

Summary: We present a tool for generating multidimensional synthetic datasets for testing, evaluating, and benchmarking unsupervised classification algorithms. Our proposal fills a gap observed in previous approaches with regard to underlying distributions for the creation of multidimensional clusters. As a novelty, normal and non-normal distributions can be combined for either independently defining values feature by feature (i.e., multivariate distributions) or establishing overall intra-cluster distances. Being highly flexible, parameterizable, and randomizable, MDCGen also implements classic pursued features: (a) customization of cluster-separation, (b) overlap control, (c) addition of outliers and noise, (d) definition of correlated variables and rotations, (e) flexibility for allowing or avoiding isolation constraints per dimension, (f) creation of subspace clusters and subspace outliers, (g) importing arbitrary distributions for the value generation, and (h) dataset quality evaluations, among others. As a result, the proposed tool offers an improved range of potential datasets to perform a more comprehensive testing of clustering algorithms.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62R07 Statistical aspects of big data and data science
62-08 Computational methods for problems pertaining to statistics
68T05 Learning and adaptive systems in artificial intelligence

Software:

MDCGen; Silhouettes
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Banerjee, A., Krumpelman, C., Ghosh, J., Basu, S., Mooney, R. J. (2005). Model-based Overlapping Clustering. In Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (pp. 532-537).
[2] Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U. (1999). When is “Nearest Neighbor” meaningful? In Proceedings of the international conference on database theory (ICDT) pp. 217-235.
[3] Daniele, M., On the rigid rotation concept in n-dimensional spaces, Journal of the Astronautical Sciences, 49, 3, 401-420 (2001)
[4] Färber, I., Günnemann, S., Kriegel, H. -P., Kröger, P., Müller, E., Schubert, E., Seidl, T., Zimek, A. (2010). On using class-labels in evaluation of clusterings. In Proceedings of the 1st international workshop on discovering, summarizing and using multiple clusterings (MultiClust 2010) in conjunction with 16th ACM SIGKDD conference on knowledge discovery and data mining, KDD: Washington.
[5] François, D.; Wertz, V.; Verleysen, M., The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering, 19, 7, 873-886 (2007)
[6] Handl, J. (2017). Accessed: cluster generators. http://personalpages.manchester.ac.uk/mbs/julia.handl/generators.html.
[7] Handl, J., & Knowles, J. (2005). Multiobjective Clustering around medoids. In 2005 IEEE Congress on evolutionary computation (Vol. 1, pp. 632-639).
[8] Higham, Nj, Computing a nearest symmetric positive semidefinite matrix, Linear Algebra and its Applications, 103, 103-118 (1988) · Zbl 0649.65026
[9] Korzeniewski, J., Empirical evaluation of OCLUS and GenRandomClust algorithms of generating cluster structures, Statistics in Transition New Series, 14, 3, 487-494 (2013)
[10] Kriegel, H-P; Kröger, P.; Zimek, A., Clustering high dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM TKDD, 3, 1, 1-58 (2009)
[11] Milligan, Gw; Cooper, Mc, A study of the comparability of external criteria for hierarchical cluster analysis, Multivariate Behavioral Research, 21, 4, 441-458 (1986)
[12] Pei, Y., & Zaïane, O. (2006). A synthetic data generator for clustering and outlier analysis. Technical report, Department of Computing Science, University of Alberta Edmonton, AB, Canada.
[13] Qiu, W.; Joe, H., Generation of random clusters with specified degree of separation, Journal of Classification, 23, 2, 315-334 (2006) · Zbl 1336.62189
[14] Rousseeuw, Pj, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics, 20, 53-65 (1987) · Zbl 0636.62059
[15] Schubert, E.; Koos, A.; Emrich, T.; Züfle, A.; Schmid, Ka; Zimek, A., A framework for clustering uncertain data, PVLDB, 8, 12, 1976-1979 (2015)
[16] Steinley, D., & Henson, R. (2005). OCLUS: an analytic method for generating clusters with known overlap, (Vol. 22. · Zbl 1336.62191
[17] Thirey, B., & Hickman, R. (2015). Distribution of Euclidean Distances Between Randomly Distributed Gaussian Points. In n-Space, SAO/NASA ADS arXiv e-prints Abstract Service (pp. 1-13). arXiv:1508.02238.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.