×

Fuzzy clustering of mixed data. (English) Zbl 1456.62120

Summary: A fuzzy clustering model for data with mixed features is proposed. The clustering model allows different types of variables, or attributes, to be taken into account. This result is achieved by combining the dissimilarity measures for each attribute by means of a weighting scheme, so as to obtain a distance measure for multiple attributes. The weights are objectively computed during the optimization process. The weights reflect the relevance of each attribute type in the clustering results. Two simulation studies and two empirical applications were carried out that show the effectiveness of the proposed clustering algorithm in finding clusters that would be otherwise hidden if a multi-attributes approach were not pursued.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H86 Multivariate analysis and fuzziness
PDF BibTeX XML Cite
Full Text: DOI Link

References:

[1] Everitt, B.; Landau, S.; Leese, M.; Stahl, D., Cluster Analysis (2011), John Wiley & Sons, Ltd: John Wiley & Sons, Ltd London · Zbl 1274.62003
[2] D’Urso, P., Fuzzy clustering, (Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Handbook of Cluster Analysis (2015), Chapman and Hall), 545-573 · Zbl 1396.62161
[3] Caiado, J.; Maharaj, E.; D’Urso, P., Time series clustering, (Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R., Handbook of Cluster Analysis (2015), Chapman and Hall), 241-263 · Zbl 1396.62196
[4] Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min. Knowl. Discov., 2, 3, 283-304 (1998)
[5] Huang, Z.; Ng, M. K., A fuzzy k-modes algorithm for clustering categorical data, IEEE Trans. Fuzzy Syst., 7, 4, 446-452 (1999)
[6] Ng, M. K.; Li, M. J.; Huang, J. Z.; He, Z., On the impact of dissimilarity measure in k-modes clustering algorithm, IEEE Trans. Pattern Anal. Mach.Intell., 29, 3, 503-507 (2007)
[7] Cao, F.; Liang, J.; Li, D.; Bai, L.; Dang, C., A dissimilarity measure for the k-modes clustering algorithm, Knowl.-Based Syst., 26, 120-127 (2012)
[8] Maharaj, E. A.; D’Urso, P., Fuzzy clustering of time series in the frequency domain, Inf. Sci., 181, 7, 1187-1211 (2011) · Zbl 1215.62061
[9] D’Urso, P.; Di Lallo, D.; Maharaj, E. A., Autoregressive model-based fuzzy clustering and its application for detecting information redundancy in air pollution monitoring networks, Soft Comput., 17, 1, 83-131 (2013)
[10] D’Urso, P.; De Giovanni, L.; Massari, R., Robust fuzzy clustering of multivariate time trajectories, Int. J. Approx. Reason., 99, 12-38 (2018) · Zbl 1453.62540
[11] Kim, D.-W.; Lee, K. H.; Lee, D., Fuzzy clustering of categorical data using fuzzy centroids, Pattern Recognit. Lett., 25, 11, 1263-1271 (2004)
[12] Bai, L.; Liang, J.; Dang, C., An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data, Knowl.-Based Syst., 24, 6, 785-795 (2011)
[13] D’Urso, P.; Massari, R., Fuzzy clustering of human activity patterns, Fuzzy Sets Syst., 215, 29-54 (2013)
[14] Pham, D. L., Spatial models for fuzzy clustering, Comput. Vis. Image Underst., 84, 2, 285-297 (2001) · Zbl 1033.68612
[15] Disegna, M.; D’Urso, P.; Durante, F., Copula-based fuzzy clustering of spatial time series, Spat. Stat., 21, 209-225 (2017)
[16] D’Urso, P.; Giovanni, L. D.; Disegna, M.; Massari, R., Fuzzy clustering with spatial temporal information, Spat. Stat., 30, 71-102 (2019)
[17] De Carvalho, F.d. A.; Tenório, C. P., Fuzzy K-means clustering algorithms for interval-valued data based on adaptive quadratic distances, Fuzzy Sets Syst., 161, 23, 2978-2999 (2010) · Zbl 1204.62106
[18] D’Urso, P.; Leski, J. M., Fuzzy c-ordered medoids clustering for interval-valued data, Pattern Recognit., 58, 49-67 (2016)
[19] D’Urso, P.; Massari, R.; De Giovanni, L.; Cappelli, C., Exponential distance-based fuzzy clustering for interval-valued data, Fuzzy Optim. Decis. Mak., 16, 1, 51-70 (2017) · Zbl 1428.62306
[20] Coppi, R.; D’Urso, P.; Giordani, P., Fuzzy and possibilistic clustering for fuzzy data, Comput. Stat. Data Anal., 56, 4, 915-927 (2012) · Zbl 1243.62089
[21] D’Urso, P.; De Giovanni, L., Robust clustering of imprecise data, Chemom. Intell. Lab. Syst., 136, 58-80 (2014)
[22] Deng, J.; Hu, J.; Chi, H.; Wu, J., An improved fuzzy clustering method for text mining, 2010 Second International Conference on Networks Security, Wireless Communications and Trusted Computing, 65-69 (2010), IEEE
[23] Nguyen, X.; Gelfand, A. E., The Dirichlet labeling process for clustering functional data, Statistica Sinica, 1249-1289 (2011) · Zbl 1223.62104
[24] Kesemen, O.; Tezel, Ö.; Özkul, E., Fuzzy c-means clustering algorithm for directional data (FCM4DD), Expert Syst. Appl., 58, 76-82 (2016)
[25] Liu, J., Detecting the fuzzy clusters of complex networks, Pattern Recognit., 43, 4, 1334-1345 (2010) · Zbl 1192.68589
[26] Hsu, C.-C.; Lin, S.-H.; Tai, W.-S., Apply extended self-organizing map to cluster and classify mixed-type data, Neurocomputing, 74, 18, 3832-3842 (2011)
[27] Guha, S.; Rastogi, R.; Shim, K., ROCK: a robust clustering algorithm for categorical attributes, Data Engineering, 1999. Proceedings., 15th International Conference on, 512-521 (1999), IEEE
[28] Dougherty, J.; Kohavi, R.; Sahami, M., Supervised and unsupervised discretization of continuous features, Machine Learning Proceedings 1995, 194-202 (1995), Elsevier
[29] Ichino, M.; Yaguchi, H., Generalized minkowski metrics for mixed feature-type data analysis, IEEE Trans. Syst. Man Cybern., 24, 4, 698-708 (1994) · Zbl 1371.68235
[30] Foss, A.; Markatou, M.; Ray, B.; Heching, A., A semiparametric method for clustering mixed data, Mach. Learn., 105, 3, 419-458 (2016) · Zbl 1432.62182
[31] Gower, J. C., A general coefficient of similarity and some of its properties, Biometrics, 857-871 (1971)
[32] Huang, Z., Clustering large data sets with mixed numeric and categorical values, Proceedings of the 1st Pacific-Asia conference on Knowledge Discovery and Data Mining,(PAKDD), 21-34 (1997)
[33] Ahmad, A.; Dey, L., A k-mean clustering algorithm for mixed numeric and categorical data, Data Knowl. Eng., 63, 2, 503-527 (2007)
[34] Liang, J.; Zhao, X.; Li, D.; Cao, F.; Dang, C., Determining the number of clusters using information entropy for mixed data, Pattern Recognit., 45, 6, 2251-2265 (2012) · Zbl 1234.68343
[35] Ji, J.; Bai, T.; Zhou, C.; Ma, C.; Wang, Z., An improved k-prototypes clustering algorithm for mixed numeric and categorical data, Neurocomputing, 120, 590-596 (2013)
[36] Ji, J.; Pang, W.; Zhou, C.; Han, X.; Wang, Z., A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data, Knowl.-Based Syst., 30, 129-135 (2012)
[37] Lu, Y.; Lu, S.; Fotouhi, F.; Deng, Y.; Brown, S. J., FGKA: a fast genetic k-means clustering algorithm, Proceedings of the 2004 ACM symposium on Applied computing, 622-623 (2004), ACM
[38] Roy, D. K.; Sharma, L. K., Genetic k-means clustering algorithm for mixed numeric and categorical data sets, Int. J. Artif. Intelli.Appl., 1, 2, 23-28 (2010)
[39] Yang, M.; Hwang, P.; Chen, D., Fuzzy clustering algorithms for mixed feature variables, Fuzzy Sets Syst., 141, 2, 301-317 (2004) · Zbl 1137.62350
[40] El-Sonbaty, Y.; Ismail, M. A., Fuzzy clustering for symbolic data, IEEE Trans. Fuzzy Syst., 6, 2, 195-204 (1998)
[41] Hathaway, R. J.; Bezdek, J. C.; Pedrycz, W., A parametric model for fusing heterogeneous fuzzy data, IEEE Trans. Fuzzy Syst., 4, 3, 270-281 (1996)
[42] Everitt, B. S., A finite mixture model for the clustering of mixed-mode data, Stat. probab. Lett., 6, 5, 305-309 (1988)
[43] Fisher, D. H., Knowledge acquisition via incremental conceptual clustering, Mach. Learn., 2, 2, 139-172 (1987)
[44] McKusick, K.; Thompson, K., Cobweb/3: A Portable Implementation, Technical Report (1990), NASA Ames Research Center
[45] Ralambondrainy, H., A conceptual version of the K-means algorithm, Pattern Recognit. Lett., 16, 11, 1147-1157 (1995)
[46] Li, C.; Biswas, G., Unsupervised learning with mixed numeric and nominal data, IEEE Trans. Knowl. Data Eng., 14, 4, 673-690 (2002)
[47] Antoni, L.; Krajči, S.; Krídlo, O.; Macek, B.; Pisková, L., On heterogeneous formal contexts, Fuzzy Sets Syst., 234, 22-33 (2014) · Zbl 1315.68232
[48] Lee, M.; Pedrycz, W., The fuzzy c-means algorithm with fuzzy p-mode prototypes for clustering objects having mixed features, Fuzzy Sets Syst., 160, 24, 3590-3600 (2009) · Zbl 1185.68601
[49] Hunt, L.; Jorgensen, M., Clustering mixed data, Wiley Interdiscip. Rev., 1, 4, 352-361 (2011)
[50] Hsu, C.-C.; Lin, S.-H., Visualized analysis of mixed numeric and categorical data via extended self-organizing map, IEEE Trans. Neural Netw. Learn.Syst., 23, 1, 72-86 (2012)
[51] Hennig, C.; Liao, T. F., How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification, J. R. Stat. Soc. Ser. C (Appl. Stati.), 62, 3, 309-369 (2013)
[52] Akay, Ö.; Yüksel, G., Clustering the mixed panel dataset using Gower’s distance and k-prototypes algorithms, Commun. Stat.-Simul.Comput., 1-11 (2017)
[53] Kaufman, L.; Rousseeuw, P., Finding groups in data: an introduction to cluster analysis (2005), WileyBlackwell
[54] Gordon, A. D., Classification, (Chapman & Hall/CRC Monographs on Statistics & Applied Probability) (1999), Chapman and Hall/CRC
[55] Fu, K.; Albus, J., Syntactic Pattern Recognition (1977), Springer-Verlag · Zbl 0356.68096
[56] Krishnapuram, R.; Joshi, A.; Nasraoui, O.; Yi, L., Low-complexity fuzzy relational clustering algorithms for web mining, IEEE Trans. Fuzzy Syst., 9, 4, 595-607 (2001)
[57] D’Urso, P.; Maharaj, E., Autocorrelation-based fuzzy clustering of time series, Fuzzy Sets Syst., 160, 24, 3565-3589 (2009)
[58] Corduas, M.; Piccolo, D., Time series clustering and classification by the autoregressive metric, Comput. Stat. Data Anal., 52, 4, 1860-1872 (2008) · Zbl 1452.62624
[59] Maharaj, E. A.; D’Urso, P.; Galagedera, D. U., Wavelet-based fuzzy clustering of time series, J. Classif., 27, 2, 231-275 (2010) · Zbl 1337.62307
[60] Berndt, D. J.; Clifford, J., Using dynamic time warping to find patterns in time series, Proceedings of the AAAI-94 Workshop Knowledge Discovery in Databases, 359-370 (1994)
[61] Sokal, R. R., A statistical method for evaluating systematic relationship, Univ. Kansas Sci. Bull., 28, 1409-1438 (1958)
[62] Eskin, E.; Arnold, A.; Prerau, M.; Portnoy, L.; Stolfo, S., A geometric framework for unsupervised anomaly detection, Applications of Data Mining in Computer Security, 77-101 (2002), Springer
[63] Karney, C. F., Algorithms for geodesics, J. Geod., 87, 1, 43-55 (2013)
[64] Hamming, R., Error detecting and error correcting codes, Bell Syst. Tech. J., 29, 2, 147-160 (1950) · Zbl 1402.94084
[65] Levenshtein, V., Binary codes capable of correcting deletions, insertions and reversals, Soviet Phys. Dokl., 10, 707-710 (1966)
[66] Kruskal, J., An overview of sequence comparison, (Sankoff, D.; Kruskal, J., Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison (1983), Addison-Wesley Publishing Company: Addison-Wesley Publishing Company Reading, MA), 1-44
[67] Yang, M.; Ko, C., On a class of fuzzy \(c\)-numbers clustering procedures for fuzzy data, Fuzzy Sets Syst., 84, 1, 49-60 (1996) · Zbl 0906.68136
[68] D’Urso, P.; Giordani, P., A weighted fuzzy c-means clustering model for fuzzy data, Comput. Stat. Data Anal., 50, 6, 1496-1523 (2006) · Zbl 1445.62157
[69] D’Urso, P.; Giordani, P., A least squares approach to principal component analysis for interval valued data, Chemom. Intell. Lab. Syst., 70, 2, 179-192 (2004)
[70] Gowda, K. C.; Diday, E., Symbolic clustering using a new dissimilarity measure, Pattern Recognit., 24, 6, 567-578 (1991)
[71] Yeung, D. S.; Wang, X., Improving performance of similarity-based clustering by feature weight learning, IEEE Trans. Pattern Anal. Mach.Intell., 24, 4, 556-561 (2002)
[72] Xie, X. L.; Beni, G., A validity measure for fuzzy clustering, IEEE Trans. Pattern Anal. Mach.Intell., 13, 8, 841-847 (1991)
[73] Campello, R. J.; Hruschka, E. R., A fuzzy extension of the silhouette width criterion for cluster analysis, Fuzzy Sets Syst., 157, 21, 2858-2875 (2006) · Zbl 1103.68674
[74] Hüllermeier, E.; Rifqi, M.; Henzgen, S.; Senge, R., Comparing fuzzy partitions: a generalization of the Rand index and related measures, Fuzzy Syst. IEEE Trans., 20, 3, 546-556 (2012)
[75] Hubert, L.; Arabie, P., Comparing partitions, J. Classif., 2, 1, 193-218 (1985)
[76] Hair, J. F.; Anderson, R. E.; Tatham, R. L.; Black, W. C., Multivariate data analysis (1998), Upper Saddle River
[77] D’Urso, P.; De Giovanni, L.; Disegna, M.; Massari, R., Bagged clustering and its application to tourism market segmentation, Expert Syst. Appl., 40, 12, 4944-4956 (2013)
[78] D’Urso, P.; Disegna, M.; Massari, R.; Osti, L., Fuzzy segmentation of postmodern tourists, Tour. Manag., 55, 297-308 (2016)
[79] Szepannek, G., clustMixType: k-Prototypes Clustering for Mixed Variable-Type Data (2018), R package version 0.1-36
[80] Foss, A. H.; Markatou, M., kamila: clustering mixed-type data in R and Hadoop, J. Stat. Softw., 83, 1, 1-44 (2018)
[81] Weidenfeld, A.; Butler, R. W.; Williams, A. M., Clustering and compatibility between tourism attractions, Int. j. Tour. Res., 12, 1, 1-16 (2010)
[82] Weidenfeld, A.; Williams, A. M.; Butler, R. W., Knowledge transfer and innovation among attractions, Ann. Tour. Res., 37, 3, 604-626 (2010)
[83] Izakian, H.; Pedrycz, W.; Jamal, I., Fuzzy clustering of time series data using dynamic time warping distance, Eng. Appl. Artif.Intell., 39, 235-244 (2015)
[84] Tenenbaum, J. B.; De Silva, V.; Langford, J. C., A global geometric framework for nonlinear dimensionality reduction, Science, 290, 5500, 2319-2323 (2000)
[85] Hijmans, R. J., geosphere: Spherical Trigonometry (2017), R package version 1.5-7
[86] Boriah, S.; Chandola, V.; Kumar, V., Similarity measures for categorical data: Acomparative evaluation, Proceedings of the 2008 SIAM International Conference on Data Mining, 243-254 (2008), SIAM
[87] Goodall, D. W., A new similarity index based on probability, Biometrics, 882-907 (1966)
[88] D’Urso, P.; Giordani, P., A robust fuzzy k-means clustering model for interval valued data, Comput. Stat., 21, 2, 251-269 (2006) · Zbl 1113.62076
[89] Pittau, M. G.; Massari, R.; Zelli, R., Hierarchical modelling of disparities in preferences for redistribution, Oxford Bull. Econ. Stat., 75, 4, 556-584 (2013)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.