×

An exact algorithm for semi-supervised minimum sum-of-squares clustering. (English) Zbl 1520.62090

Summary: The minimum sum-of-squares clustering (MSSC), or \(k\)-means type clustering, is traditionally considered an unsupervised learning task. In recent years, the use of background knowledge to improve the cluster quality and promote interpretability of the clustering process has become a hot research topic at the intersection of mathematical optimization and machine learning research. The problem of taking advantage of background information in data clustering is called semi-supervised or constrained clustering. In this paper, we present branch-and-cut algorithm for semi-supervised MSSC, where background knowledge is incorporated as pairwise must-link and cannot-link constraints. For the lower bound procedure, we solve the semidefinite programming relaxation of the MSSC discrete optimization model, and we use a cutting-plane procedure for strengthening the bound. For the upper bound, instead, by using integer programming tools, we use an adaptation of the \(k\)-means algorithm to the constrained case. For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points with different combinations of must-link and cannot-link constraints and with a generic number of features. This problem size is about four times larger than the one of the instances solved by state-of-the-art exact algorithms.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
90C22 Semidefinite programming
90C57 Polyhedral combinatorics, branch-and-bound, branch-and-cut
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Alizadeh, F., Interior point methods in semidefinite programming with applications to combinatorial optimization, SIAM J. Optim., 5, 1, 13-51 (1995) · Zbl 0833.90087
[2] Aloise, D.; Deshpande, A.; Hansen, P.; Popat, P., NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn., 75, 245-248 (2009) · Zbl 1378.68047
[3] Aloise, D.; Hansen, P., A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering, Pesqui. Operacional, 29, 3, 503-516 (2009)
[4] Aloise, D.; Hansen, P.; Liberti, L., An improved column generation algorithm for minimum sum-of-squares clustering, Math. Program., 131, 1, 195-220 (2012) · Zbl 1236.90095
[5] Aloise, D.; Hansen, P.; Rocha, C., A column generation algorithm for semi-supervised minimum sum-of-squares clustering, (Global Optimization Workshop 2012 (2012)), 19-22
[6] Babaki, B.; Guns, T.; Nijssen, S., Constrained clustering using column generation, (International Conference on AI and OR Techniques in Constriant Programming for Combinatorial Optimization Problems (2014), Springer), 438-454 · Zbl 1407.68387
[7] Basu, S.; Banerjee, A.; Mooney, R., Active semi-supervision for pairwise constrained clustering, (Proceedings of the SIAM International Conference on Data Mining (2004))
[8] Basu, S.; Davidson, I.; Wagstaff, K., Constrained Clustering: Advances in Algorithms, Theory, and Applications (2008), CRC Press
[9] Baumann, P., A binary linear programming-based k-means algorithm for clustering with must-link and cannot-link constraints, (2020 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM) (2020), IEEE), 324-328
[10] Bilenko, M., Basu, S., Mooney, R.J., 2004. Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the Twenty-First International Conference on Machine Learning. p. 11.
[11] Borgwardt, S.; Brieden, A.; Gritzmann, P., Geometric clustering for the consolidation of farmland and woodland, Math. Intelligencer, 36, 2, 37-44 (2014) · Zbl 1401.52025
[12] Brieden, A.; Gritzmann, P.; Klemm, F., Constrained clustering via diagrams: A unified theory and its application to electoral district design, European J. Oper. Res., 263, 1, 18-34 (2017) · Zbl 1380.91072
[13] Brusco, M. J., A repetitive branch-and-bound procedure for minimum within-cluster sums of squares partitioning, Psychometrika, 71, 2, 347-363 (2006) · Zbl 1306.62387
[14] Celebi, M. E.; Kingravi, H. A.; Vela, P. A., A comparative study of efficient initialization methods for the k-means clustering algorithm, Expert Syst. Appl., 40, 1, 200-210 (2013)
[15] Dau, H. A.; Keogh, E.; Kamgar, K.; Yeh, C.-C. M.; Zhu, Y.; Gharghabi, S.; Ratanamahatana, C. A.; Yanping; Hu, B.; Begum, N.; Bagnall, A.; Mueen, A.; Batista, G., The UCR time series classification archive (2018), https://www.cs.ucr.edu/ eamonn/time_series_data_2018/
[16] Davidson, I.; Ravi, S., Clustering with constraints: Feasibility issues and the k-means algorithm, (Proceedings of the 2005 SIAM International Conference on Data Mining (2005), SIAM), 138-149
[17] Davidson, I., Ravi, S., 2007. Intractability and clustering with constraints. In: Proceedings of the 24th International Conference on Machine Learning. pp. 201-208.
[18] Diehr, G., Evaluation of a branch and bound algorithm for clustering, SIAM J. Sci. Stat. Comput., 6, 2, 268-284 (1985) · Zbl 0561.65097
[19] Dinler, D.; Tural, M. K., A survey of constrained clustering, (Unsupervised Learning Algorithms (2016), Springer), 207-235
[20] Du Merle, O.; Hansen, P.; Jaumard, B.; Mladenovic, N., An interior point algorithm for minimum sum-of-squares clustering, SIAM J. Sci. Comput., 21, 4, 1485-1505 (1999) · Zbl 1049.90129
[21] Dua, D.; Graff, C., UCI machine learning repository (2017), URL: http://archive.ics.uci.edu/ml
[22] Duong, K.-C.; Vrain, C., A declarative framework for constrained clustering, (Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2013), Springer), 419-434
[23] Duong, K.-C.; Vrain, C., Constrained minimum sum of squares clustering by constraint programming, (International Conference on Principles and Practice of Constraint Programming (2015), Springer), 557-573
[24] Duong, K.-C.; Vrain, C., Constrained clustering by constraint programming, Artificial Intelligence, 244, 70-94 (2017) · Zbl 1404.68141
[25] Eckart, C.; Young, G., The approximation of one matrix by another of lower rank, Psychometrika, 1, 3, 211-218 (1936) · JFM 62.1075.02
[26] Filippone, M.; Camastra, F.; Masulli, F.; Rovetta, S., A survey of kernel and spectral methods for clustering, Pattern Recognit., 41, 1, 176-190 (2008) · Zbl 1122.68530
[27] Gambella, C.; Ghaddar, B.; Naoum-Sawaya, J., Optimization problems for machine learning: A survey, European J. Oper. Res., 290, 3, 807-828 (2021) · Zbl 1487.90004
[28] Gançarski, P.; Dao, T.-B.-H.; Crémilleux, B.; Forestier, G.; Lampert, T., Constrained clustering: Current and new trends, (A Guided Tour of Artificial Intelligence Research (2020), Springer), 447-484
[29] Ganji, M.; Bailey, J.; Stuckey, P. J., Lagrangian constrained clustering, (Proceedings of the 2016 SIAM International Conference on Data Mining (2016), SIAM), 288-296
[30] Gnägi, M.; Baumann, P., A matheuristic for large-scale capacitated clustering, Comput. Oper. Res., 132, Article 105304 pp. (2021) · Zbl 1510.90155
[31] González-Almagro, G.; Luengo, J.; Cano, J.-R.; García, S., DILS: constrained clustering through dual iterative local search, Comput. Oper. Res., 121, Article 104979 pp. (2020) · Zbl 1458.68202
[32] Guns, T.; Dao, T.-B.-H.; Vrain, C.; Duong, K.-C., Repetitive branch-and-bound using constraint programming for constrained minimum sum-of-squares clustering, (Proceedings of the Twenty-Second European Conference on Artificial Intelligence. Proceedings of the Twenty-Second European Conference on Artificial Intelligence, ECAI’16 (2016), IOS Press: IOS Press NLD), 462-470
[33] Gurobi Optimization, L., Gurobi optimizer reference manual (2021), URL: http://www.gurobi.com
[34] Hartigan, J. A.; Wong, M. A., Algorithm AS 136: A k-means clustering algorithm, J. R. Stat. Soc. Ser. C. Appl. Stat., 28, 1, 100-108 (1979) · Zbl 0447.62062
[35] Hu, G.; Zhou, S.; Guan, J.; Hu, X., Towards effective document clustering: A constrained K-means based approach, Inf. Process. Manage., 44, 4, 1397-1409 (2008)
[36] Huang, H.; Cheng, Y.; Zhao, R., A semi-supervised clustering algorithm based on must-link set, (International Conference on Advanced Data Mining and Applications (2008), Springer), 492-499
[37] Huang, Y., Mitchell, T.M., 2006. Text clustering with extended user feedback. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 413-420.
[38] Hubert, L.; Arabie, P., Comparing partitions, J. Classification, 2, 1, 193-218 (1985)
[39] Jain, A. K.; Murty, M. N.; Flynn, P. J., Data clustering: a review, ACM Comput. Surv., 31, 3, 264-323 (1999)
[40] Jansson, C.; Chaykin, D.; Keil, C., Rigorous error bounds for the optimal value in semidefinite programming, SIAM J. Numer. Anal., 46, 1, 180-200 (2008) · Zbl 1167.90009
[41] Koontz, W. L.G.; Narendra, P. M.; Fukunaga, K., A branch and bound clustering algorithm, IEEE Trans. Comput., 100, 9, 908-915 (1975) · Zbl 0308.68039
[42] Krislock, N.; Malick, J.; Roupin, F., Computational results of a semidefinite branch-and-bound algorithm for k-cluster, Comput. Oper. Res., 66, 153-159 (2016) · Zbl 1349.90715
[43] Lai, X.; Hao, J.-K.; Fu, Z.-H.; Yue, D., Neighborhood decomposition-driven variable neighborhood search for capacitated clustering, Comput. Oper. Res., Article 105362 pp. (2021) · Zbl 1511.90405
[44] Li, X.; Yin, H.; Zhou, K.; Zhou, X., Semi-supervised clustering with deep metric learning and graph embedding, World Wide Web, 23 (2020)
[45] Liberti, L.; Manca, B., Side-constrained minimum sum-of-squares clustering: mathematical programming and random projections, J. Global Optim., 1-36 (2021)
[46] Lloyd, S., Least squares quantization in PCM, IEEE Trans. Inform. Theory, 28, 2, 129-137 (1982) · Zbl 0504.94015
[47] MacQueen, J., et al., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Oakland, CA, USA, Vol. 14. pp. 281-297. · Zbl 0214.46201
[48] Maraziotis, I. A., A semi-supervised fuzzy clustering algorithm applied to gene expression data, Pattern Recognit., 45, 1, 637-648 (2012) · Zbl 1225.68200
[49] Pacheco, J. A., A scatter search approach for the minimum sum-of-squares clustering problem, Comput. Oper. Res., 32, 5, 1325-1335 (2005) · Zbl 1075.90037
[50] Pena, J. M.; Lozano, J. A.; Larranaga, P., An empirical comparison of four initialization methods for the k-means algorithm, Pattern Recognit. Lett., 20, 10, 1027-1040 (1999)
[51] Peng, J.; Wei, Y., Approximating k-means-type clustering via semidefinite programming, SIAM J. Optim., 18, 1, 186-205 (2007) · Zbl 1146.90046
[52] Peng, J.; Xia, Y., A cutting algorithm for the minimum sum-of-squared error clustering, (Proceedings of the 2005 SIAM International Conference on Data Mining (2005), SIAM), 150-160
[53] Pensa, R. G.; Boulicaut, J.-F., Constrained co-clustering of gene expression data, (Proceedings of the 2008 SIAM International Conference on Data Mining (2008), SIAM), 25-36
[54] Piccialli, V.; Sudoso, A. M.; Wiegele, A., SOS-SDP: an exact solver for minimum sum-of-squares clustering, INFORMS J. Comput. (2022) · Zbl 07587562
[55] Rao, M., Cluster analysis and mathematical programming, J. Amer. Statist. Assoc., 66, 335, 622-626 (1971) · Zbl 0238.90042
[56] Rutayisire, T.; Yang, Y.; Lin, C.; Zhang, J., A modified cop-kmeans algorithm based on sequenced cannot-link set, (Yao, J.; Ramanna, S.; Wang, G.; Suraj, Z., Rough Sets and Knowledge Technology (2011), Springer Berlin Heidelberg: Springer Berlin Heidelberg Berlin, Heidelberg), 217-225
[57] Sherali, H. D.; Desai, J., A global optimization RLT-based approach for solving the hard clustering problem, J. Global Optim., 32, 2, 281-306 (2005) · Zbl 1123.62045
[58] Sun, D.; Toh, K.-C.; Yang, L., A convergent 3-block semiproximal alternating direction method of multipliers for conic programming with 4-type constraints, SIAM J. Optim., 25, 2, 882-915 (2015) · Zbl 1328.90083
[59] Sun, D.; Toh, K.-C.; Yuan, Y.; Zhao, X.-Y., SDPNAL+: A matlab software for semidefinite programming with bound constraints (version 1.0), Optim. Methods Softw., 35, 1, 87-115 (2020) · Zbl 1432.90104
[60] Tan, W.; Yang, Y.; Li, T., An improved COP-kmeans algorithm for solving constraint violation, (Computational Intelligence: Foundations and Applications (2010), World Scientific), 690-696
[61] Tran, D. H.; Babaki, B.; Van Daele, D.; Leyman, P.; De Causmaecker, P., Local search for constrained graph clustering in biological networks, Comput. Oper. Res., 132, Article 105299 pp. (2021) · Zbl 1510.90281
[62] Vassilvitskii, S., Arthur, D., 2006. k-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1027-1035. · Zbl 1302.68273
[63] Vinh, N. X.; Epps, J.; Bailey, J., Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance, J. Mach. Learn. Res., 11, 2837-2854 (2010) · Zbl 1242.62062
[64] Vrain, C.; Davidson, I., Constrained clustering via post-processing, (International Conference on Discovery Science (2020), Springer), 53-67
[65] Wagstaff, K.; Cardie, C.; Rogers, S.; Schroedl, S., Constrained k-means clustering with background knowledge, (Icml, Vol. 1 (2001)), 577-584
[66] Xia, Y., A global optimization method for semi-supervised clustering, Data Min. Knowl. Discov., 18, 2, 214-256 (2009)
[67] Xiang, S.; Nie, F.; Zhang, C., Learning a mahalanobis distance metric for data clustering and classification, Pattern Recognit., 41, 12, 3600-3612 (2008) · Zbl 1162.68642
[68] Yang, L.; Sun, D.; Toh, K.-C., SDPNAL+: a majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints, Math. Program. Comput., 7, 3, 331-366 (2015) · Zbl 1321.90085
[69] Zhang, H.; Basu, S.; Davidson, I., A framework for deep constrained clustering-algorithms and advances, (Joint European Conference on Machine Learning and Knowledge Discovery in Databases (2019), Springer), 57-72
[70] Zhu, X.; Goldberg, A. B., Introduction to semi-supervised learning, Synth. Lect. Artif. Intell. Mach. Learn., 3, 1, 1-130 (2009) · Zbl 1209.68435
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.