×

SpectralCAT: categorical spectral clustering of numerical and nominal data. (English) Zbl 1225.68171

Summary: Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. Although many clustering algorithms have been proposed, most of them deal with clustering of one data type (numerical or nominal) or with mix data type (numerical and nominal) and only few of them provide a generic method that clusters all types of data. It is required for most real-world applications data to handle both feature types and their mix. In this paper, we propose an automated technique, called SpectralCAT, for unsupervised clustering of high-dimensional data that contains numerical or nominal or mix of attributes. We suggest to automatically transform the high-dimensional input data into categorical values. This is done by discovering the optimal transformation according to the Calinski-Harabasz index for each feature and attribute in the dataset. Then, a method for spectral clustering via dimensionality reduction of the transformed data is applied. This is achieved by automatic non-linear transformations, which identify geometric patterns in the data, and find the connections among them while projecting them onto low-dimensional spaces. We compare our method to several clustering algorithms using 16 public datasets from different domains and types. The experiments demonstrate that our method outperforms in most cases these algorithms.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68T10 Pattern recognition, speech recognition
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Huang, Z. X., Extensions to the \(k\)-means algorithm for clustering large data sets with categorical values, Data Mining Knowledge Discovery, 3, 283-304 (1998)
[2] MacQueen, J., Some methods for classification and analysis of multivariate observations, (Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 (1967), University of California Press: University of California Press Berkeley, CA), 281-297 · Zbl 0214.46201
[3] Huang, Z., Extensions to the \(k\)-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery, 2, 283-304 (1998)
[4] Calinski, T.; Harabasz, J., A dendrite method for cluster analysis, Communications in Statistics, 3, 1-27 (1974) · Zbl 0273.62010
[5] Gan, C. M.G.; Wu, J., Data Clustering: Theory, Algorithms, and Applications (2007), SIAM
[6] S.V.D. Arthur, \(k\); S.V.D. Arthur, \(k\) · Zbl 1302.68273
[7] C. Ding, T. Li, Adaptive dimension reduction using discriminant analysis and \(k\); C. Ding, T. Li, Adaptive dimension reduction using discriminant analysis and \(k\)
[8] G. David, A. Averbuch, R. R. Coifman, Hierarchical clustering via localized diffusion folders, in: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Fall Symposium Series 2010, 〈http://aaai.org/ocs/index.php/FSS/FSS10/paper/view/2210; G. David, A. Averbuch, R. R. Coifman, Hierarchical clustering via localized diffusion folders, in: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Fall Symposium Series 2010, 〈http://aaai.org/ocs/index.php/FSS/FSS10/paper/view/2210
[9] Zhang, R. R.T.; Livny, M., Birch: an efficient data clustering method for very large databases, SIGMOD Record, 25, 103-114 (1996)
[10] R.R.S. Guha, K. Shim, Cure: an efficient clustering algorithms for large databases, in: ACM SIGMOD International Conference on Management of Data, Seattle, WA, 1998, pp. 73-84.; R.R.S. Guha, K. Shim, Cure: an efficient clustering algorithms for large databases, in: ACM SIGMOD International Conference on Management of Data, Seattle, WA, 1998, pp. 73-84.
[11] Ester, J. S.M.; Kriegel, H. P.; Xu, X., A density-based algorithm for discovering clusters in large spatial database with noise, (International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96) (1996), AAAI Press: AAAI Press Portland, Oregon), 226-231
[12] Kaufman, L.; Rousseeuw, P., Finding Groups in Data—An introduction to Cluster Analysis (1990), John Wiely & Sons, Inc.: John Wiely & Sons, Inc. New York · Zbl 1345.62009
[13] Ganti, J. G.V.; Ramakrishnan, R., Cactus: clustering categorical data using summaries, (Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (1999), ACM Press: ACM Press San Diego, CA, USA), 73-83
[14] Barbara, J. C.D.; Li, Y., Coolcat: an entropy-based algorithm for categorical clustering, (Proceedings of the 11th ACM Conference on Information and Knowledge Management (CIKM 02) (2002), ACM Press: ACM Press McLean, Virginia, USA), 582-589
[15] Guha, R. R.S.; Shim, K., Rock: a robust clustering algorithm for categorical attributes, Journal of Information Systems, 25, 345-366 (2000)
[16] Gibson, J. K.D.; Raghavan, P., Clustering categorical data: an approach based on dynamical systems, (Proceedings of the 24th International Conference on Very Large Data Bases (VLDB) (1998), Morgan Kaufmann: Morgan Kaufmann New York, NY, USA), 311-322
[17] Ben-Hur, H. S.A.; Horn, D.; Vapnik, V., Support vector clustering, Journal of Machine Learning Research, 2, 125-137 (2001) · Zbl 1002.68598
[18] Girolami, M., Mercer kernel based clustering in feature space, IEEE Transactions on Neural Networks, 13, 780-784 (2002)
[19] R. Zhang, A. Rudnicky, A large scale clustering scheme for kernel \(k\); R. Zhang, A. Rudnicky, A large scale clustering scheme for kernel \(k\)
[20] Couto, J., Kernel \(k\)-means for categorical data, Advances in Intelligent Data Analysis VI, 3646, 46-56 (2005) · Zbl 1141.68559
[21] von Luxburg, U., A tutorial on spectral clustering, Statistics and Computing, 17, 4, 395-416 (2007)
[22] Bach, F. R.; Jordan, M. I., Learning spectral clustering, (Proceedings of NIPS, vol. 16 (2004), MIT Press), 305-312
[23] S.V.R. Kannan, A. Vetta, On clusterings: good, bad and spectral, in: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000.; S.V.R. Kannan, A. Vetta, On clusterings: good, bad and spectral, in: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, 2000. · Zbl 1192.05160
[24] Shi, J.; Malik, J., Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888-905 (2000)
[25] Ng, M. I.J. A.Y.; Weiss, Y., On spectral clustering: analysis and an algorithm, (Advances in Neural Information Processing Systems (NIPS), vol. 14 (2002))
[26] Fukunaga, K.; Hostetler, L., The estimation of the gradient of a density function, with applications in pattern recognition, IEEE Transactions on Information Theory in Information Theory, 21, 1, 32-40 (1975) · Zbl 0297.62025
[27] Comaniciu, D.; Meer, P., Mean shift: a robust approach towards feature space analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 5, 603-619 (2002)
[28] Cheng, Y., Mean shift, mode seeking, and clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 8, 790-799 (1995)
[29] Comaniciu, D., An algorithm for data-driven bandwidth selection, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 281-288 (2003)
[30] Gower, J. C., A general coefficient of similarity and some of its properties, Biometrics, 27, 857-871 (1971)
[31] Dougherty, R. K.J.; Sahami, M., Supervised and unsupervised discretization of continuous features, (Machine Learning: Proceedings of the Twelfth International Conference (1995), Morgan Kaufmann Publishers)
[32] Catlett, J., On changing continuous attributes into ordered discrete attributes, (Proceedings of the European Working Session on Learning (1991), Springer-Verlag: Springer-Verlag Berlin, Germany), 164-178
[33] R.S. Michalski, A planar geometric model for representing multidimensional discrete spaces and multiple-valued logic functions, Technical Report, University of Illinois at Urbaba-Champaign, 1978.; R.S. Michalski, A planar geometric model for representing multidimensional discrete spaces and multiple-valued logic functions, Technical Report, University of Illinois at Urbaba-Champaign, 1978.
[34] Han, J.; Kambert, M., Data Mining: Concepts and Techniques (2001), Morgan Kaufmann: Morgan Kaufmann San Francisco
[35] Oppenheim, A. V.; Schafer, R. W., Discrete-Time Signal Processing (1989), Prentice-Hall · Zbl 0676.42001
[36] C. Blake, C. Merz, Uci repository of machine learning databases, University of California, Department of Information and Computer Science, Irvine, CA, USA, \(1998 \langle\) http://www.ics.uci.edu/mlearn/MLRepository.html \(\rangle \); C. Blake, C. Merz, Uci repository of machine learning databases, University of California, Department of Information and Computer Science, Irvine, CA, USA, \(1998 \langle\) http://www.ics.uci.edu/mlearn/MLRepository.html \(\rangle \)
[37] Coifman, R. R.; Lafon, S., Diffusion maps, Applied and Computational Harmonic Analysis, 21, 5-30 (2006) · Zbl 1095.68094
[38] Coifman, R. R.; Lafon, S., Geometric harmonics: a novel tool for multiscale out-of-sample extension of empirical functions, Applied and Computational Harmonic Analysis, 21, 31-52 (2006) · Zbl 1095.68095
[39] F.R.K. Chung, Spectral Graph Theory, in: AMS Regional Conference Series in Mathematics, vol. 92, 1997.; F.R.K. Chung, Spectral Graph Theory, in: AMS Regional Conference Series in Mathematics, vol. 92, 1997. · Zbl 0867.05046
[40] A.L.L.L.N. Benoudjit, C. Archambeau, M. Verleysen, Width optimization of the Gaussian kernels in radial basis function networks, in: Proceedings of the European Symposium on Artificial Neural Networks, ESANN 2002 Bruges, Belgium, 2002, pp. 425-432.; A.L.L.L.N. Benoudjit, C. Archambeau, M. Verleysen, Width optimization of the Gaussian kernels in radial basis function networks, in: Proceedings of the European Symposium on Artificial Neural Networks, ESANN 2002 Bruges, Belgium, 2002, pp. 425-432.
[41] Benoudjit, N.; Verleysen, M., On the kernel widths in radial-basis function networks, Neural Processing Letters, 18, 139-154 (2003)
[42] Sanchez, V. D., On the number and the distribution of RBF centers, Neurocomputing, 7, 197-202 (1995)
[43] Chen, S.; Billings, S., Neural networks for nonlinear dynamic system modelling and identification, International Journal of Control, 56, 319-346 (1992) · Zbl 0764.93021
[44] M. Orr, Introduction to radial basis function networks, \(1996 \langle\) http://www.anc.ed.ac.uk/rbf/papers/intro.ps \(\rangle \); M. Orr, Introduction to radial basis function networks, \(1996 \langle\) http://www.anc.ed.ac.uk/rbf/papers/intro.ps \(\rangle \)
[45] Park, J.; Sandberg, I., Universal approximation using radial basis function networks, Neural Computation, 3, 246-257 (1991)
[46] Haykin, S., Neural Networks: a Comprehensive Foundation (1998), Prentice Hall · Zbl 0828.68103
[47] M. Verleysen, K. Hlavackova, Learning in RBF networks, in: International Conference on Neural Networks (ICNN), Washington, DC, 1996, pp. 199-204.; M. Verleysen, K. Hlavackova, Learning in RBF networks, in: International Conference on Neural Networks (ICNN), Washington, DC, 1996, pp. 199-204.
[48] Saha, A.; Keeler, J., Algorithms for better representation and faster learning in radial basis function networks, (Advances in Neural Information Processing Systems, vol. 2 (1990), Morgan Kaufmann Publishers Inc.: Morgan Kaufmann Publishers Inc. San Francisco, CA)
[49] Moody, J.; Darken, C., Fast learning in networks of locally-tuned processing units, Neural Computation, 1, 281-294 (1989)
[50] Pearson, K., On lines and planes of closest fit to systems of points in space, Philosophical Magazine, 2, 559-572 (1901) · JFM 32.0246.07
[51] Tan, M. S.P-N.; Kumar, V., Introduction to Data Mining (2006), Pearson Addison-Wesley
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.