Correcting Jaccard and other similarity indices for chance agreement in cluster analysis. (English) Zbl 1274.62414

Summary: Correcting a similarity index for chance agreement requires computing its expectation under fixed marginal totals of a matching counts matrix. For some indices, such as Jaccard, Rogers and Tanimoto, Sokal and Sneath, and Gower and Legendre the expectations cannot be easily found. We show how such similarity indices can be expressed as functions of other indices and expectations found by approximations such that approximate correction is possible. A second approach is based on Taylor series expansion. A simulation study illustrates the effectiveness of the resulting correction of similarity indices using structured and unstructured data generated from bivariate normal distributions.


62H30 Classification and discrimination; cluster analysis (statistical aspects)


Full Text: DOI


[1] Albatineh AN, Niewiadomska-Bugaj M, Mihalko DP (2006) On similarity indices and correction for chance agreement. J Classif 23: 301–313 · Zbl 1336.62168
[2] Albatineh AN, Niewiadomska-Bugaj M (2011) MCS: a method for finding the number of clusters. J Classif 28. doi: 10.1007/s00357-010-9069-1 · Zbl 1271.62130
[3] Albatineh AN (2010) Means and variances for a family of similarity indices used in cluster analysis. J Stat Plan Inference 140: 2828–2838 · Zbl 1191.62111
[4] Czekanowski J (1932) ”Coefficient of racial likeness” und ”durchschnittliche Differenz”. Anthropologischer Anzeiger 14: 227–249
[5] Dice LR (1945) Measures of the amount of ecological association between species. Ecology 26: 297–302
[6] Fligner MA, Verducci JS, Blower PE (2002) A modification of the Jaccard–Tanimoto similarity index for diverse selection of chemical compounds using binary strings. Technometrics 44: 110–119
[7] Fowlkes EB, Mallows CL (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78: 553–569 · Zbl 0545.62042
[8] Gower JC, Legendre P (1986) Metric and Euclidean properties of dissimilarity coefficients. J Classif 3: 5–48 · Zbl 0592.62048
[9] Hamann U (1961) Merkmalsbestand und Verwandtschaftsbeziehungen der Farinosae. Willdenowia 2: 639–768
[10] Hubálek Z (1982) Coefficients of association and similarity based on binary (presence–absence) data: an evaluation. Biol Rev 57: 669–689
[11] Hubert L, Arabie P (1985) Comparing partitions. J Classif 2: 193–218 · Zbl 0587.62128
[12] Jaccard P (1908) Nouvelles recherches sur la distribution florale. Bull Soc Vaudoise Sci Nat 44: 223–270
[13] Jaccard P (1912) The distribution of the flora of the alpine zone. New Phytol 11: 37–50
[14] Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New Jersey · Zbl 0665.62061
[15] Janson S, Vegelius J (1981) Measures of ecological association. Oecologia 49: 371–376
[16] Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32: 241–254 · Zbl 1367.62191
[17] Kulczynski S (1927) Die Pflanzenassoziationen der Pinien, Bulletin International de L’Académie Polonaise des Sciences et des Lettres, Classe des Sciences Mathématiques et Naturelles. Series B, Supplément II 2: 57–203
[18] Lamont BB, Grant KJ (1979) A comparison of twenty-one measures of site dissimilarity. In: Orlóci L, Rao CR, Stiteler WM (eds) Multivariate methods in ecological work. International Cooperation Publishing House, Fairland, pp 101–126
[19] Lancaster HO (1969) The Chi-squared distribution. John Wiley, New York · Zbl 0193.17802
[20] Lehmann EL (1959) Testing statistical hypothesis. Wiley, New York · Zbl 0089.14102
[21] Legendre P, Legendre L (1998) Numerical ecology. Elsevier, Amsterdam · Zbl 1033.92036
[22] Mcconnaughey BH (1964) The determination and analysis of plankton communities. Marine Research, Special No, Indonesia, pp 1–40
[23] Milligan G, Cooper M (1986) A study of the comparability of external criteria for hierarchical cluster analysis. Multivar Behav Res 21: 441–458
[24] Milligan G, Soon S, Sokol L (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Patt Anal Mach Intell PAMI-5: 40–47
[25] Morey L, Agresti A (1984) The measurement of classification agreement: an adjustment to the Rand statistic for chance agreement. Educ Psychol Meas 44: 33–37
[26] Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 846–850
[27] Rogers DJ, Tanimoto TT (1960) A computer program for classifying plants. Science 132: 1115–1118
[28] Russell PF, Rao TR (1940) On habitat and association of species of anopheline larvae in South-Eastern Madras. J Malar Inst India 3: 153–178
[29] Saxena PC, Navaneerham K (1991) The effect of cluster size, dimensionality, and number of clusters on recovery of true cluster structure through Chernoff-type faces. Statistician 40: 415–425
[30] Saxena PC, Navaneerham K (1993) Comparison of Chernoff-type face and non-graphical methods for clustering multivariate observations. Comput Stat Data Anal 15: 63–79 · Zbl 0937.62527
[31] Snijders TAB, Dormaar M, Van Schuur WH, Dijkman-Caes C, Driessen G (1990) Distribution of some similarity coefficients for dyadic binary data in the case of associated attributes. J Classif 7: 5–31 · Zbl 0711.62054
[32] Sokal RR, Michener CD (1958) A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull 38: 1409–1438
[33] Sokal RR, Sneath PHA (1963) Principles of numerical taxonomy. WH Freeman, San Francisco
[34] Sørensen T (1948) A Method of establishing groups of equal amplitude in plant sociology based on similarity of species content. Biologiske Skrifter 5: 1–34
[35] Southwood TS (1978) Ecological methods. Chapman and Hall, London
[36] Steinley D (2004) Properties of the Hubert–Arabie adjusted Rand index. Psychol Methods 9: 386–396
[37] Van Der Maarel E (1969) On the use of ordination models in phytosociology. Vegetatio 19: 21–46
[38] Wallace DL (1983) A method for comparing two hierarchical clusterings: comment. J Am Stat Assoc 78: 569–576
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.