×

Comparing clusterings – an information based distance. (English) Zbl 1298.91124

Summary: This paper proposes an information theoretic criterion for comparing two partitions, or clusterings, of the same data set. The criterion, called variation of information (VI), measures the amount of information lost and gained in changing from clustering \(\mathcal C\) to clustering \(\mathcal C'\). The basic properties of VI are presented and discussed. We focus on two kinds of properties: (1) those that help one build intuition about the new criterion (in particular, it is shown the VI is a true metric on the space of clusterings), and (2) those that pertain to the comparability of VI values over different experimental conditions. As the latter properties have rarely been discussed explicitly before, other existing comparison criteria are also examined in their light. Finally we present the VI from an axiomatic point of view, showing that it is the only “sensible” criterion for comparing partitions that is both aligned to the lattice and convexely additive. As a consequence, we prove an impossibility result for comparing partitions: there is no criterion for comparing partitions that simultaneously satisfies the above two desirable properties and is bounded.

MSC:

91C20 Clustering in the social and behavioral sciences
94A17 Measures of information, entropy
05A18 Partitions of sets
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Anderberg, M.R., Cluster analysis for applications, (1973), Academic Press New York, NY · Zbl 0299.62029
[2] Ben-Hur, A.; Elisseeff, A.; Guyon, I., A stability based method for discovering structure in clustered data, (), 6-17
[3] Cover, T.M.; Thomas, J.A., Elements of information theory, (1991), Wiley New York · Zbl 0762.94001
[4] Fowlkes, E.B.; Mallows, C.L., A method for comparing two hierarchical clusterings, J. amer. statist. assoc., 78, 383, 553-569, (1983) · Zbl 0545.62042
[5] Hartigan, J.A., Clustering algorithms, (1975), Wiley New York, NY · Zbl 0321.62069
[6] Hubert, L.; Arabie, P., Comparing partitions, J. classification, 2, 193-218, (1985)
[7] Jain, A.K.; Dubes, R.C., Algorithms for clustering data, (1988), Prentice-Hall Englewood Cliffs, NJ · Zbl 0665.62061
[8] Larsen, B.; Aone, C., Fast and effective text mining using linear time document clustering, (), 16-22
[9] Lloyd, S.P., Least squares quantization in PCM, IEEE trans. inform. theory, 28, 129-137, (1982) · Zbl 0504.94015
[10] MacLachlan, G.J.; Bashford, K.E., Mixture models: inference and applications to clustering, (1988), Marcel Dekker NY
[11] Meilă, M., Comparing clusterings—an axiomatic view, ()
[12] Meilă, M.; Heckerman, D., An experimental comparison of modelbased clustering methods, Mach. learning, 42, 1/2, 9-29, (2001)
[13] Mirkin, B.G., Mathematical classification and clustering, (1996), Kluwer Academic Press Dordrecht · Zbl 1008.91101
[14] Mirkin, B.G.; Cherny, L.B., Measurement of the distance between distinct partitions of a finite set of objects, Automat. remote control, 31, 5, 786, (1970) · Zbl 0221.05029
[15] Rand, W.M., Objective criteria for the evaluation of clustering methods, J. amer. statist. assoc., 66, 846-850, (1971)
[16] Rényi, A., Probability theory, (1970), North-Holland Amsterdam · Zbl 0206.18002
[17] Stanley, R.P., Enumerative combinatorics, (1997), Cambridge University Press Cambridge · Zbl 0889.05001
[18] Steinley, D.L., Properties of the hubert – arabie adjusted rand index, Psychological methods, 9, 3, 386-396, (2004), (simulations of some adjusted indices and of misclassification error)
[19] S. van Dongen, Performance criteria for graph clustering and Markov cluster experiments, Technical Report INS-R0012, Centrum voor Wiskunde en Informatica, 2000.
[20] Wallace, D.L., Comment, J. amer. statist. assoc., 78, 383, 569-576, (1983)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.