×

A new class of weighted similarity indices using polytomous variables. (English) Zbl 1360.62343

Summary: We introduce new similarity measures between two subjects, with reference to variables with multiple categories. In contrast to traditionally used similarity indices, they also take into account the frequency of the categories of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. A weighting criterion for each category derived from Shannon’s information theory is suggested. There are two versions of the weighted index: one for independent categorical variables and one for dependent variables. The suitability of the proposed indices is shown in this paper using both simulated and real world data sets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H25 Factor analysis and principal components; correspondence analysis
62P15 Applications of statistics to psychology
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] ALBATINEH, A.N., NIEWIADOMKA-BUGAJ, M., and MIHALKO, D. (2006), ”On Similarity Indices and Correction for Chance Agreement”, Journal of Classification, 23, 301–313. · Zbl 1336.62168 · doi:10.1007/s00357-006-0017-z
[2] ANDERBERG, M.R. (1973), Cluster Analysis for Applications, New York: Academic Press. · Zbl 0299.62029
[3] ARABIE P., HUBERT, L.J., and DE SOETE, G. (1996), Clustering and Classification, River Edge, NJ: World Scientific. · Zbl 0836.00014
[4] BAUER, D.J., and CURRAN, P.J. (2003), ”Distributional Assumptions of Growth Mixture Models: Implications for Overextraction of Latent Trajectory Classes”, Psychological Methods, 8, 338–363. · doi:10.1037/1082-989X.8.3.338
[5] BAULIEU, F.B. (1989), ”A Classification of Presence/Absence Based Dissimilarity Coefficients”, Journal of Classification, 6, 233–246. · Zbl 0691.62056 · doi:10.1007/BF01908601
[6] BORIAH, S., CHANDOLA, V., and KUMAR, V. (2008), ”Similarity Measures for Categorical Data: A Comparative Evaluation”, Proceedings of 2008 SIAM Data Mining Conference, Atlanta, GA.
[7] BRUSCO, M.J. (2004), ”Clustering Binary Data in the Presence of Masking Variables”, Psychological Methods, 9, 510–523. · doi:10.1037/1082-989X.9.4.510
[8] BURNABY, T.P. (1970), ”On a Method for Character Weighting a Similarity Coefficient, Employing the Concept of Information”, Mathematical Geology, 2, 25–38. · doi:10.1007/BF02332078
[9] BURNHAM, K.P., and ANDERSON, D.R. (2002), Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2nd ed.), New York: Springer Science. · Zbl 1005.62007
[10] CHATURVEDI, A.D., CARROL, J.D., GREEN, P.E., and ROTONDO, J.A. (1997), ”A Feature Based Approach toMarket Segmentation via Overlapping K-Centroids Clusters”, Journal of Marketing Research, 34, 370–377. · doi:10.2307/3151899
[11] CHATURVEDI, A.D., GREEN, P.E., and CARROL, J.D. (2001), ”K-Modes Clustering”, Journal of Classification, 18, 35–55. · Zbl 1047.91566 · doi:10.1007/s00357-001-0004-3
[12] COVER, T.M., and THOMAS, J.A. (2006), Elements of Information Theory (2nd ed.), New York: Wiley-Interscience. · Zbl 1140.94001
[13] DORFMAN, J.H. (2007), Introduction to MATLAB Programming, with an Emphasis on Software Design through Numerical Examples, Berkeley, CA: Decagon Press.
[14] EVERITT, B.S., LANDAU, S., and LEESE, M. (2001), Cluster Analysis, New York: OxfordUniversity Press. · Zbl 1205.62076
[15] GABARRO ARPA, J., and REVILLA, R. (2000), ”Clustering of a Molecular Dynamics Trajectory with a Hamming Distance”, Computers and Chemistry, 24, 693–698. · doi:10.1016/S0097-8485(00)00067-X
[16] GASIENIEC, L., JASSON, J., and LINGAS, A. (2004), ”Approximation Algorithms for Hamming Clustering Problems”, Journal of Discrete Algorithms, 2, 289–301. · Zbl 1118.68762 · doi:10.1016/S1570-8667(03)00079-0
[17] GIFI, A. (1990), Nonlinear Multivariate Analysis, Chicester: Wiley. · Zbl 0697.62048
[18] GNANADESIKAN, R., KETTENRING, J.R., and MALOOR, S. (2007), ”Better Alternatives to Current Methods of Scaling andWeighting Data for Cluster Analysis”, Journal of Statistical Planning and Inference, 173, 3483–3496. · Zbl 1119.62058 · doi:10.1016/j.jspi.2007.03.026
[19] GOLDMAN, S. (2005), Information Theory, New York: Prentice Hall. · Zbl 1154.94348
[20] GOODMAN, G.D., and KRUSKAL, W.H. (1954). ”Measures of Association for Cross Classification”, Journal of the American Statistical Association, 49, 732–765. · Zbl 0056.12801
[21] GORDON, A.D. (1999), Classification (2nd ed.), New York: Chapman & Hall, CRC. · Zbl 0929.62068
[22] GOWER, J.C. (1970), ”A Note on Burnaby’s Character-Weighted Similarity Coefficient”, Mathematical Geology, 2-1, 39–45. · doi:10.1007/BF02332079
[23] GOWER, J.C. (1971), ”A General Coefficient of Similarity and Some of its Properties”, Biometrics, 27, 857–871. · doi:10.2307/2528823
[24] GOWER, J.C., and LEGENDRE, P. (1986), ”Metric and Euclidean Properties of Dissimilarity Coefficients”, Journal of Classification, 3, 5–48. · Zbl 0592.62048 · doi:10.1007/BF01896809
[25] GREENACRE, M.J. (1984), Correspondence Analysis in Practice (2nd ed.), Florida: Chapman & Hall.
[26] GREENACRE, M.J. (2007), Theory and Applications of Correspondence Analysis, London: Academic Press. · Zbl 1198.62061
[27] HAMMING, R.W. (1950), ”Error Detecting and Error Correcting Codes”, Bell System Technical Journal, 29, 147–160. · doi:10.1002/j.1538-7305.1950.tb00463.x
[28] HEISER, W.J., and MEULMAN, J.J. (1997), ”Representation of Binary Multivariate Data by Graph Models Using the Hamming Metric”, in Computing Science and Statistics, 29-2, eds. E. Wegman and S. Azen, pp. 517–525.
[29] HELSEN, K., and GREEN, P.E. (1991), ”A Computational Study of Replicated Clustering with an Application to Marketing Research”, Decision Science, 22, 1124–1141. · doi:10.1111/j.1540-5915.1991.tb01910.x
[30] HUBERT, L., and ARABIE, P. (1985), ”Comparing Partitions”, Journal of Classification, 2, 193–218. · Zbl 0587.62128 · doi:10.1007/BF01908075
[31] JACCARD, P. (1901), ”Etude Comparative de la Distribution Florale Dans Une Portion des Alpes et des Jura”, Bulletin de la Société Vaudoise des Sciences Naturelles, 37, 547–579.
[32] KURCZYNKY, T.W. (1970), ”Generalized Distance and Discrete Variables”, Biometrics, 26-3, 525–534. · doi:10.2307/2529106
[33] LEBART, L. MORINEAU, A., and WARWICK, K. (1984), Multivariate Descriptive Statistical Analysis: Correspondence Analysis and Related Techiques for Large Matrices, New York: Wiley. · Zbl 0658.62069
[34] MACKAY, D.J.C. (2003), Information Theory, Inference and Learning Algorithms, Cambridge, UK: Cambridge University Press. · Zbl 1055.94001
[35] MILLIGAN, G.W., and COOPER, M.C. (1986), ”A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis”, Multivariate Behavioral Research, 21, 441–458. · doi:10.1207/s15327906mbr2104_5
[36] MOREY, L., and AGRESTI, A. (1984), ”The Measurement of Classification of Agreement: An Adjustment to the Rand Statistic for Chance Agreement”, Educational and Psychological Measurement, 44, 33–37. · doi:10.1177/0013164484441003
[37] RAND, W.M. (1971), ”Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, 6, 846–850. · doi:10.1080/01621459.1971.10482356
[38] REGISTER, A.H. (2007), A Guide to MATLAB Object-Oriented Programming, New York: Chapman & Hall, CRC. · Zbl 1116.68022
[39] SEPKOSKI, J.J. (1974), ”Quantified Coefficients of Association and Measurement of Similarity”, Mathematical Geology, 6, 135–152. · doi:10.1007/BF02080152
[40] SHANNON, C.E. (1948), ”A Mathematical Theory of Communication”, Bell System Technical Journal, 27, 379–423. · Zbl 1154.94303 · doi:10.1002/j.1538-7305.1948.tb01338.x
[41] SKRONDAL, A., and RABE-HESKETH, S. (2004), Generalized Latent Variable Modeling: Multilevel, Longitudinal and Structural Equation Models, Boca Raton FL: Chapman & Hall/CRC. · Zbl 1097.62001
[42] SNEATH, P.H., and SOKAL, R.R. (1973), Numerical Taxonomy, San Francisco CA: Freeman. · Zbl 0285.92001
[43] STEINLEY, D. (2004),”Properties of the Hubert-Arabie Adjusted Rand Index”, Psychological Methods, 9, 386–396. · doi:10.1037/1082-989X.9.3.386
[44] STEINLEY, D. (2006), ”Profiling Local Optima in the K-Means Clustering: Developing a Diagnostic Technique. Psychological Methods, 11, 178–192. · doi:10.1037/1082-989X.11.2.178
[45] STEINLEY, D., and BRUSCO, M.J. (2008), ”A New Variable Weighting and Selection Procedure for K-Means Cluster Analysis”, Multivariate Behavioral Research, 43, 77–108. · doi:10.1080/00273170701836695
[46] TENENHAUS, M., and YOUNG, F.W. (1985), ”An Analysis and Synthesis of Multiple Correspondence Analysis, Optimal Scaling, Dual Scaling, Homogeneity Analysis and Other Methods for Quantifying Categorical Multivariate Data”, Psychometrica, 50, 91–119. · Zbl 0585.62104 · doi:10.1007/BF02294151
[47] VANBELLE, S., and ALBERT A. (2009), ”A Note on the Linearly Weighted Kappa Coefficient for Ordinal Scales”, Statistical Methodology, 6, 157–163. · Zbl 1220.62172 · doi:10.1016/j.stamet.2008.06.001
[48] WARRENS,M.J. (2008a), ”On the Indeterminacy of the Resemblance Measures for Binary (Presence/Absence) Data”, Journal of Classification, 25, 125–136 · Zbl 1260.62052 · doi:10.1007/s00357-008-9006-8
[49] WARRENS, M.J. (2008b), ”On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index”, Journal of Classification, 25, 177–183. · Zbl 1276.62043 · doi:10.1007/s00357-008-9023-7
[50] WARRENS, M.J. (2008c), ”Bounds of Resemblance Measures for Binary (Presence/ Absence) Variables”, Journal of Classification, 25, 195–208 · Zbl 1276.62044 · doi:10.1007/s00357-008-9024-6
[51] WARRENS, M.J. (2008d), ”On Association Coefficients for 2{\(\times\)}2 Tables and Properties That Do Not Depend on the Marginal Distributions”, Psychometrika, 73, 778–289. · Zbl 1284.62762
[52] WARRENS, M.J. (2010), ”Chance-Corrected Measures for 2{\(\times\)}2 Tables that Coincide with Weighted Kappa”, British Journal of Mathematical and Statistical Psychology, 64, 355–365. · Zbl 1218.62060 · doi:10.1348/2044-8317.002001
[53] WARRENS, M.J. (2011), ”Inequalities Between Kappa and Kappa-Like Statistics for k{\(\times\)}k Tables”, Psychometrika, 75, 176–185. · Zbl 1272.62138 · doi:10.1007/s11336-009-9138-8
[54] ZANI, S. (1982), ”Sui Criteri di Ponderazione negli Indici di Similarità”, in Alcuni Lavori di Analisi Statistica Multivariata, ed. R. Leoni, Firenze, Italia, SIS, pp. 187–208.
[55] ZEGERS, F.E., and TEN BERGE J.M.F. (1986), ”Correlation Coefficients forMore tha One Scale Type: An Alternative to the Janson and Vegelius Approach”, Psychometrika, 51, 549–557. · Zbl 0646.62099 · doi:10.1007/BF02295593
[56] ZHANG, P., WANG, X., and SONG, P.X. (2006), ”Clustering Categorical Data Based on Distance Vectors”, Journal of the American Statistical Association, 101, 355–367. · Zbl 1118.62341 · doi:10.1198/016214505000000312
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.