×

Faster balanced clusterings in high dimension. (English) Zbl 1455.68274

Summary: The problem of constrained clustering has attracted significant attention in the past decades. In this paper, we study the balanced \(k\)-center, \(k\)-median, and \(k\)-means clustering problems where the size of each cluster is constrained by the given lower and upper bounds. The problems are motivated by the applications in processing large-scale data in high dimension. Existing methods often need to compute complicated matchings (or min cost flows) to satisfy the balance constraint, and thus suffer from high complexities especially in high dimension. We develop an effective framework for the three balanced clustering problems to address this issue, and our method is based on a novel spatial partition idea in geometry. For the balanced \(k\)-center clustering, we provide a 4-approximation algorithm that improves the existing approximation factors; for the balanced \(k\)-median and \(k\)-means clusterings, our algorithms yield constant and \((1 + \epsilon)\)-approximation factors with any \(\epsilon > 0\). More importantly, our algorithms achieve linear or nearly linear running times when \(k\) is a constant, and significantly improve the existing ones. Our results can be easily extended to metric balanced clusterings and the running times are sub-linear in terms of the complexity of \(n\)-point metric.

MSC:

68W25 Approximation algorithms
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68U05 Computer graphics; computational geometry (digital and algorithmic aspects)
68W40 Analysis of algorithms
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aggarwal, G.; Panigrahy, R.; Feder, T.; Thomas, D.; Kenthapadi, K.; Khuller, S.; Zhu, A., Achieving anonymity via clustering, ACM Trans. Algorithms, 6, 3, 49 (2010) · Zbl 1300.68023
[2] Ahmadian, S.; Swamy, C., Improved approximation guarantees for lower-bounded facility location, (Approximation and Online Algorithms - 10th International Workshop. Approximation and Online Algorithms - 10th International Workshop, WAOA 2012, Ljubljana, Slovenia, September 13-14, 2012 (2012)), 257-271, Revised Selected Papers · Zbl 1395.68332
[3] Ahmadian, S.; Swamy, C., Approximation algorithms for clustering problems with lower bounds and outliers, (43rd International Colloquium on Automata, Languages, and Programming. 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11-15, 2016, Rome, Italy (2016)), Article 69 pp. · Zbl 1395.90172
[4] An, H.-C.; Bhaskara, A.; Chekuri, C.; Gupta, S.; Madan, V.; Svensson, O., Centrality of trees for capacitated k k-center, Math. Program., 154, 1-2, 29-53 (2015) · Zbl 1337.90036
[5] Arora, S.; Raghavan, P.; Rao, S., Approximation schemes for Euclidean k-medians and related problems, (Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing (1998), ACM), 106-113 · Zbl 1027.68979
[6] Arya, V.; Garg, N.; Khandekar, R.; Meyerson, A.; Munagala, K.; Pandit, V., Local search heuristics for k-median and facility location problems, SIAM J. Comput., 33, 3, 544-562 (2004) · Zbl 1105.68118
[7] Awasthi, P.; Charikar, M.; Krishnaswamy, R.; Sinop, A. K., The hardness of approximation of Euclidean k-means, (31st International Symposium on Computational Geometry. 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands (2015)), 754-767 · Zbl 1378.68048
[8] Aydin, K.; Bateni, M.; Mirrokni, V., Distributed balanced partitioning via linear embedding, (Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (2016), ACM), 387-396
[9] Badoiu, M.; Clarkson, K. L., Smaller core-sets for balls, (Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA) (2003)), 801-802 · Zbl 1092.68660
[10] Badoiu, M.; Har-Peled, S.; Indyk, P., Approximate clustering via core-sets, (Proceedings of the ACM Symposium on Theory of Computing (STOC) (2002)), 250-257 · Zbl 1192.68871
[11] Banerjee, A.; Ghosh, J., Scalable clustering algorithms with balancing constraints, Data Min. Knowl. Discov., 13, 3, 365-395 (2006)
[12] Barilan, J.; Kortsarz, G.; Peleg, D., How to allocate network centers, J. Algorithms, 15, 3, 385-415 (1993) · Zbl 0784.68012
[13] Bateni, M.; Bhaskara, A.; Lattanzi, S.; Mirrokni, V., Distributed balanced clustering via mapping coresets, (Advances in Neural Information Processing Systems (2014)), 2591-2599
[14] Bhattacharya, A.; Jaiswal, R.; Kumar, A., Faster algorithms for the constrained k-means problem, Theory Comput. Syst., 62, 1, 93-115 (2018) · Zbl 1387.68296
[15] Borgwardt, S.; Brieden, A.; Gritzmann, P., An lp-based k-means algorithm for balancing weighted point sets, Eur. J. Oper. Res., 263, 2, 349-355 (2017) · Zbl 1381.62154
[16] Byrka, J.; Pensyl, T.; Rybicki, B.; Srinivasan, A.; Trinh, K., An improved approximation for k-median and positive correlation in budgeted optimization, ACM Trans. Algorithms, 13, 2, 23 (2017) · Zbl 1454.90069
[17] Charikar, M.; Guha, S., Improved combinatorial algorithms for facility location problems, SIAM J. Comput., 34, 4, 803-824 (2005) · Zbl 1075.68100
[18] Chen, K., On coresets for k-median and k-means clustering in metric and Euclidean spaces and their applications, SIAM J. Comput., 39, 3, 923-947 (2009) · Zbl 1192.68880
[19] Cohen-Addad, V.; Klein, P. N.; Mathieu, C., Local search yields approximation schemes for k-means and k-median in Euclidean and minor-free metrics, (Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on (2016), IEEE), 353-364
[20] Cygan, M.; Hajiaghayi, M.; Khuller, S., Lp rounding for k-centers with non-uniform hard capacities, (Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on (2012), IEEE), 273-282
[21] Dasgupta, S., The hardness of k-means clustering (2008), Technical Report
[22] Dick, T.; Li, M.; Pillutla, V. K.; White, C.; Balcan, N.; Smola, A., Data driven resource allocation for distributed learning, (Artificial Intelligence and Statistics (2017)), 662-671
[23] Ding, H., Balanced k-center clustering when k is A constant, (Proceedings of the 29th Canadian Conference on Computational Geometry. Proceedings of the 29th Canadian Conference on Computational Geometry, CCCG 2017, July 26-28, 2017 (2017), Carleton University: Carleton University Ottawa, Ontario, Canada), 179-184
[24] Ding, H.; Hu, L.; Huang, L.; Li, J., Capacitated center problems with two-sided bounds and outliers, (Workshop on Algorithms and Data Structures (2017), Springer), 325-336 · Zbl 1454.68179
[25] Ding, H.; Xu, J., A unified framework for clustering constrained data without locality property, (Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms. Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January 4-6, 2015 (2015)), 1471-1490 · Zbl 1371.68291
[26] Edelsbrunner, H.; Valtr, P.; Welzl, E., Cutting dense point sets in half, Discrete Comput. Geom., 17, 3, 243-255 (1997) · Zbl 0870.68153
[27] Ene, A.; Har-Peled, S.; Raichel, B., Fast clustering with lower bounds: no customer too far, no shop too small (2013), arXiv preprint
[28] J. Erickson, Course lecture: Extensions of maximum flow.
[29] Esfandiari, H.; Mirrokni, V. S.; Zhong, P., Streaming balanced clustering (2019), CoRR
[30] Feldman, D.; Langberg, M., A unified framework for approximating and clustering data, (Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing (2011), ACM), 569-578 · Zbl 1288.90046
[31] Friggstad, Z.; Rezapour, M.; Salavatipour, M. R., Local search yields a PTAS for k-means in doubling metrics, (Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on (2016), IEEE), 365-374
[32] Gonzalez, T. F., Clustering to minimize the maximum intercluster distance, Theor. Comput. Sci., 38, 293-306 (1985) · Zbl 0567.62048
[33] Guruswami, V.; Indyk, P., Embeddings and non-approximability of geometric problems, (Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (2003), Society for Industrial and Applied Mathematics), 537-538 · Zbl 1092.68690
[34] Hochbaum, D. S.; Shmoys, D. B., A best possible heuristic for the k-center problem, Math. Oper. Res., 10, 2, 180-184 (1985) · Zbl 0565.90015
[35] Indyk, P., Sublinear time algorithms for metric space problems, (Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing (1999), ACM), 428-434 · Zbl 1346.68256
[36] Indyk, P.; Motwani, R.; Venkatasubramanian, S., Geometric matching under noise: combinatorial bounds and algorithms, (Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms (1999), Society for Industrial and Applied Mathematics), 457-465 · Zbl 0934.68108
[37] Jain, A. K., Data clustering: 50 years beyond k-means, Pattern Recognit. Lett., 31, 8, 651-666 (2010)
[38] Jain, K.; Vazirani, V. V., Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation, J. ACM, 48, 2, 274-296 (2001) · Zbl 1138.90417
[39] Jaiswal, R.; Kumar, A.; Sen, S., A simple d 2-sampling based PTAS for k-means and other clustering problems, Algorithmica, 70, 1, 22-46 (2014) · Zbl 1364.68369
[40] Jaiswal, R.; Kumar, M.; Yadav, P., Improved analysis of d2-sampling based PTAS for k-means and other clustering problems, Inf. Process. Lett., 115, 2, 100-103 (2015) · Zbl 1302.68341
[41] Jaiswal, R.; Sen, S., Approximate clustering, (Handbook of Approximation Algorithms and Metaheuristics (2018), Chapman and Hall/CRC), 169-186
[42] Kanungo, T.; Mount, D. M.; Netanyahu, N. S.; Piatko, C. D.; Silverman, R.; Wu, A. Y., A local search approximation algorithm for k-means clustering, Comput. Geom., 28, 2-3, 89-112 (2004) · Zbl 1077.68109
[43] Khuller, S.; Sussmann, Y. J., The capacitated k-center problem, SIAM J. Discrete Math., 13, 3, 403-418 (2000) · Zbl 0947.05073
[44] Kociumaka, T.; Cygan, M., Constant factor approximation for capacitated k-center with outliers (2014), arXiv preprint · Zbl 1359.90056
[45] Kolliopoulos, S. G.; Rao, S., A nearly linear-time approximation scheme for the Euclidean k-median problem, SIAM J. Comput., 37, 3, 757-782 (2007) · Zbl 1144.68052
[46] Kuehn, A. A.; Hamburger, M. J., A heuristic program for locating warehouses, Manag. Sci., 9, 4, 643-666 (1963)
[47] Kumar, A.; Sabharwal, Y.; Sen, S., Linear-time approximation schemes for clustering problems in any dimensions, J. ACM, 57, 2, 5 (2010) · Zbl 1327.68334
[48] Li, S., On uniform capacitated k-median beyond the natural LP relaxation, ACM Trans. Algorithms, 13, 2, 22 (2017) · Zbl 1451.90089
[49] Li, S.; Svensson, O., Approximating k-median via pseudo-approximation, SIAM J. Comput., 45, 2, 530-547 (2016) · Zbl 1338.90346
[50] Lin, W.; He, Z.; Xiao, M., Balanced clustering: a uniform model and fast algorithm, (Kraus, S., Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019 (2019)), 2987-2993
[51] Mahajan, M.; Nimbhorkar, P.; Varadarajan, K., The planar k-means problem is NP-hard, Theor. Comput. Sci., 442, 13-21 (2012) · Zbl 1260.68158
[52] Malinen, M. I.; Fränti, P., Balanced k-means for clustering, (Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (2014), Springer), 32-41
[53] Manne, A. S., Plant location under economies-of-scale—decentralization and computation, Manag. Sci., 11, 2, 213-235 (1964)
[54] Megiddo, N.; Supowit, K. J., On the complexity of some common geometric location problems, SIAM J. Comput., 13, 1, 182-196 (1984) · Zbl 0534.68032
[55] Orlin, J. B., Max flows in o (nm) time, or better, (Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing (2013), ACM), 765-774 · Zbl 1293.05151
[56] Rösner, C.; Schmidt, M., Privacy preserving clustering with constraints, (45th International Colloquium on Automata, Languages, and Programming. 45th International Colloquium on Automata, Languages, and Programming, ICALP 2018, July 9-13, 2018, Prague, Czech Republic (2018)), Article 96 pp. · Zbl 1499.90117
[57] Vapnik, V.; Bottou, L., Local algorithms for pattern recognition and dependencies estimation, Neural Comput., 5, 6, 893-909 (1993)
[58] A. Vattani, The hardness of k-means clustering in the plane. · Zbl 1380.68204
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.