Hierarchical constraints. (English) Zbl 1319.68167

Summary: Constrained clustering received a lot of attention in the last years. However, the widely used pairwise constraints are not generally applicable for hierarchical clustering, where the goal is to derive a cluster hierarchy instead of a flat partition. Therefore, we propose for the hierarchical setting – based on the ideas of pairwise constraints – the use of must-link-before (MLB) constraints. In this paper, we discuss their properties and present an algorithm that is able to create a hierarchy by considering these constraints directly. Furthermore, we propose an efficient data structure for its implementation and evaluate its effectiveness with different datasets in a text clustering scenario.


68T05 Learning and adaptive systems in artificial intelligence
62H30 Classification and discrimination; cluster analysis (statistical aspects)


Full Text: DOI


[1] Amigó, E.; Gonzalo, J.; Artiles, J.; Verdejo, F., A comparison of extrinsic clustering evaluation metrics based on formal constraints, Information Retrieval, 12, 461-486, (2009)
[2] Bade, K.; Benz, D., Evaluation strategies for learning algorithms of hierarchies, 83-92, (2010)
[3] Bade, K.; Nürnberger, A., Personalized hierarchical clustering, 181-187, (2006), Washington
[4] Bade, K.; Nürnberger, A., Creating a cluster hierarchy under constraints of a partially known hierarchy, 13-24, (2008)
[5] Bade, K.; Nürnberger, A., Learning a metric during hierarchical clustering based on constraints, (2009)
[6] Bade, K.; Hermkes, M.; Nürnberger, A.; Kok, J. N. (ed.); Koronacki, J. (ed.); Mántaras, R. L. (ed.); Matwin, S. (ed.); Mladenic, D. (ed.); Skowron, A. (ed.), User oriented hierarchical information organization and retrieval, No. 4701, 518-526, (2007), Berlin
[7] Bae, E.; Bailey, J., Coala: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity, 53-62, (2006), Washington
[8] Bagga, A.; Baldwin, B., Entity-based cross-document coreferencing using the vector space model, 79-85, (1998)
[9] Bar-Hillel, A.; Hertz, T.; Shental, N.; Weinshall, D., Learning distance functions using equivalence relations, 11-18, (2003)
[10] Bar-Hillel, A.; Hertz, T.; Shental, N.; Weinshall, D., Learning a Mahalanobis metric from equivalence constraints, Journal of Machine Learning Research, 6, 937-965, (2005) · Zbl 1222.68140
[11] Basu, S.; Banerjee, A.; Mooney, R. J., Semi-supervised clustering by seeding, 27-34, (2002)
[12] Basu, S.; Banerjee, A.; Mooney, R., Active semi-supervision for pairwise constrained clustering, 333-344, (2004)
[13] Basu, S.; Bilenko, M.; Mooney, R. J., A probabilistic framework for semi-supervised clustering, 59-68, (2004)
[14] Basu, S., Davidson, I., & Wagstaff, K. L. (Eds.) (2008). Constrained clustering: advances in algorithms, theory, and applications. London/Boca Raton: Chapman & Hall/CRC. · Zbl 1142.68005
[15] Bilenko, M.; Basu, S.; Mooney, R. J., Integrating constraints and metric learning in semi-supervised clustering, 81-88, (2004)
[16] Borgelt, C. (2005). Prototype-base classification and clustering. Habilitation, Otto-von-Guericke-University Magdeburg.
[17] Borgelt, C.; Nürnberger, A., Fast fuzzy clustering of web page collections, Pisa, Italy
[18] Brank, J.; Mladenic, D.; Groblenik, M., Gold standard based ontology evaluation using instance assignment, (2006)
[19] Cathey, R. J.; Jensen, E. C.; Beitzel, S. M.; Frieder, O.; Grossman, D., Exploiting parallelism to support scalable hierarchical clustering, Journal of the American Society for Information Science and Technology, 58, 1207-1221, (2007)
[20] Choi, B.; Peng, X., Dynamic and hierarchical classification of web pages, Online Information Review, 28, 139-147, (2004)
[21] Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Technical Report TR2003-1892). Cornell University. · Zbl 1161.68759
[22] Davidson, I.; Ravi, S. S., Agglomerative hierarchical clustering with constraints: theoretical and empirical results, 59-70, (2005)
[23] Davidson, I.; Ravi, S. S., Clustering with constraints: feasibility issues and the \(k\)-means algorithm, 138-149, (2005)
[24] Davidson, I.; Ravi, S. S., Identifying and generating easy sets of constraints for clustering, 336-341, (2006)
[25] Davidson, I.; Ravi, S. S., The complexity of non-hierarchical clustering with instance and cluster level constraints, Data Mining and Knowledge Discovery, 14, 25-61, (2007)
[26] Davidson, I.; Ravi, S. S., Intractability and clustering with constraints, 201-208, (2007)
[27] Davidson, I.; Ravi, S. S., Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results, Data Mining and Knowledge Discovery, 18, 257-282, (2009)
[28] Davidson, I.; Wagstaff, K.; Basu, S., Measuring constraint-set utility for partitional clustering algorithms, 115-126, (2006)
[29] Day, W. H. E.; Edelsbrunner, H., Efficient algorithms for agglomerative hierarchical clustering methods, Journal of Classification, 1, 7-24, (1984) · Zbl 0563.62034
[30] Finley, T.; Joachims, T., Supervised clustering with support vector machines, 217-224, (2005)
[31] Fisher, D., Knowledge acquisition via incremental conceptual clustering, Machine Learning, 2, 139-172, (1987)
[32] Gonzalez, R. C., & Woods, R. E. (2007). Digital image processing. New York: Prentice-Hall.
[33] Grira, N.; Crucianu, M.; Boujemaa, N., Fuzzy clustering with pairwise constraints for knowledge-driven image categorization, 299-304, (2004)
[34] Halkidi, M.; Batistakis, Y.; Vazirgiannis, M., Cluster validity methods: part I, ACM SIGMOD Record, 31, 40-45, (2002)
[35] Halkidi, M.; Batistakis, Y.; Vazirgiannis, M., Clustering validity checking methods: part II, ACM SIGMOD Record, 31, 19-27, (2002)
[36] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction (2nd ed.). Berlin: Springer. http://www-stat.stanford.edu/ tibs/ElemStatLearn/. · Zbl 1273.62005
[37] Jones, W. (2008). Keeping found things found. San Mateo: Morgan Kaufmann.
[38] Jones, W., & Teevan, J. (Eds.) (2007). Personal information management. Seatle: University of Washington Press.
[39] Kestler, H. A.; Kraus, J. M.; Palm, G.; Schwenker, F.; Schwenker, F. (ed.); Marinai, S. (ed.), On the effects of constraints in semi-supervised hierarchical clustering, No. 4087, 57-66, (2006)
[40] Khosla, R., Westfall, D. G., Reich, R. M., Mahal, J. S., & Gangloff, W. J. (2010). Spatial variation and site-specific management zones (1st ed., pp. 195-219). Berlin: Springer.
[41] Kim, H.; Lee, S., An effective document clustering method using user-adaptable distance metrics, 16-20, (2002)
[42] Klein, D.; Kamvar, S.; Manning, C., From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering, 307-314, (2002)
[43] Manning, C. D., & Schütze, H. (1999). Foundations of natural language processing. Cambridge: MIT Press. · Zbl 0951.68158
[44] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. · Zbl 1160.68008
[45] McKusick, K. B.; Langley, P., Constraints on tree structure in concept formation, 810-816, (1991) · Zbl 0751.68056
[46] Rand, W. M., Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66, 622-626, (1971)
[47] van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London: Butterworths. · Zbl 0227.68052
[48] Ruiz, C.; Menasalvas, E.; Spiliopoulou, M., Constraint-based query clustering, 304-309, (2007)
[49] Ruiz, C.; Spiliopoulou, M.; Menasalvas, E., C-dbscan: density-based clustering with constraints, 216-223, (2007)
[50] Ruß, G.; Kruse, R., Exploratory hierarchical clustering for management zone delineation in precision agriculture, No. 6870, 161-173, (2011), Berlin
[51] Salton, G.; Buckley, C., Term-weighting approaches in automatic text retrieval, Information Processing & Management, 25, 513-523, (1988)
[52] Schultz, M.; Joachims, T., Learning a distance metric from relative comparisons, (2004)
[53] Sebastiani, F., Machine learning in automated text categorization, ACM Computing Surveys, 34, 1-47, (2002)
[54] Sinka, M.; Corne, D., A large benchmark dataset for web document clustering, No. 87, 881-890, (2002)
[55] Wagstaff, K. (2002). Intelligent clustering with instance-level constraints. PhD thesis, Cornell University.
[56] Wagstaff, K.; Cardie, C., Clustering with instance-level constraints, 1103-1110, (2000)
[57] Wagstaff, K.; Cardie, C.; Rogers, S.; Schroedl, S., Constrained \(k\)-means clustering with background knowledge, 577-584, (2001)
[58] Xing, E.; Ng, A.; Jordan, M.; Russell, S., Distance metric learning with application to clustering with side-information, Advances in Neural Information Processing Systems, 15, 505-512, (2003)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.