## A local search approximation algorithm for $$k$$-means clustering.(English)Zbl 1077.68109

Summary: In $$k$$-means clustering we are given a set of $$n$$ data points in $$d$$-dimensional space $$\operatorname{Re}^d$$ and an integer $$k$$, and the problem is to determine a set of $$k$$ points in $$\operatorname{Re}^d$$, called centers, to minimize the mean squared distance from each data point to its nearest center. No exact polynomial-time algorithms are known for this problem. Although asymptotically efficient approximation algorithms exist, these algorithms are not practical due to the very high constant factors involved. There are many heuristics that are used in practice, but we know of no bounds on their performance.
We consider the question of whether there exists a simple and practical approximation algorithm for $$k$$-means clustering. We present a local improvement heuristic based on swapping centers in and out. We prove that this yields a ($$9+\varepsilon$$)-approximation algorithm. We present an example showing that any approach based on performing a fixed number of swaps achieves an approximation factor of at least ($$9-\varepsilon$$) in all sufficiently high dimensions. Thus, our approximation factor is almost tight for algorithms based on performing a fixed number of swaps. To establish the practical value of the heuristic, we present an empirical study that shows that, when combined with Lloyd’s algorithm, this heuristic performs quite well in practice.

### MSC:

 68U05 Computer graphics; computational geometry (digital and algorithmic aspects) 68W25 Approximation algorithms

clusfind
Full Text:

### References:

 [1] Agarwal, P.K.; Procopiuc, C.M., Exact and approximation algorithms for clustering, (), 658-667 · Zbl 0930.68168 [2] Alsabti, K.; Ranka, S.; Singh, V., An efficient k-means clustering algorithm, () [3] Arora, S.; Raghavan, P.; Rao, S., Approximation schemes for euclidean k-Median and related problems, (), 106-113 · Zbl 1027.68979 [4] Arya, S.; Mount, D.M.; Narayan, O., Accounting for boundary effects in nearest-neighbor searching, Discrete comput. geom., 16, 155-176, (1996) · Zbl 0853.68081 [5] Arya, V.; Garg, N.; Khandekar, R.; Pandit, V.; Meyerson, A.; Munagala, K., Local search heuristics for k-Median and facility location problems, (), 21-29 · Zbl 1323.90031 [6] Ball, G.H.; Hall, D.J., Some fundamental concepts and synthesis procedures for pattern recognition preprocessors, () · Zbl 0193.15303 [7] Bandyopadhyay, S.; Maulik, U.; Pakhira, M.K., Clustering using simulated annealing with probabilistic redistribution, Internat. J. patt. recog. artif. intell., 15, 269-285, (2001) [8] Capoyleas, V.; Rote, G.; Woeginger, G., Geometric clusterings, J. algorithms, 12, 341-356, (1991) · Zbl 0734.68092 [9] Charikar, M.; Guha, S., Improved combinatorial algorithms for the facility location and k-medians problem, (), 378-388 [10] Du, Q.; Faber, V.; Gunzburger, M., Centroidal Voronoi tesselations: applications and algorithms, SIAM rev., 41, 637-676, (1999) · Zbl 0983.65021 [11] Duda, R.O.; Hart, P.E., Pattern classification and scene analysis, (1973), Wiley New York · Zbl 0277.68056 [12] ElGamal, A.E.; Hemanchandra, L.A.; Shperling, I.; Wei, V.K., Using simulated annealing to design good codes, IEEE trans. inform. theory, 33, 116-123, (1987) [13] Eppstein, D., Faster construction of planar two-centers, () · Zbl 1321.68500 [14] Faber, V., Clustering and the continuous k-means algorithm, Los alamos sci., 22, 138-144, (1994) [15] Fayyad, U.M.; Piatetsky-Shapiro, G.; Smyth, P.; Uthurusamy, R., Advances in knowledge discovery and data mining, (1996), AAAI/MIT Press [16] Feller, W., An introduction to probability theory and its applications, (1968), Wiley New York · Zbl 0155.23101 [17] Forgey, E., Cluster analysis of multivariate data: efficiency vs. interpretability of classification, Biometrics, 21, 768, (1965) [18] Garey, M.R.; Johnson, D.S., Computers and intractability: A guide to the theory of NP-completeness, (1979), Freeman New York · Zbl 0411.68039 [19] Gersho, A.; Gray, R.M., Vector quantization and signal compression, (1992), Kluwer Academic Boston, MA · Zbl 0782.94001 [20] Inaba, M.; Katoh, N.; Imai, H., Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering, (), 332-339 [21] Jain, A.K.; Dubes, R.C., Algorithms for clustering data, (1988), Prentice Hall Englewood Cliffs, NJ · Zbl 0665.62061 [22] Jain, A.K.; Duin, P.W.; Mao, J., Statistical pattern recognition: A review, IEEE trans. patt. anal. Mach. intell., 22, 1, 4-37, (2000) [23] Jain, A.K.; Murty, M.N.; Flynn, P.J., Data clustering: A review, ACM comput. surv., 31, 3, 264-323, (1999) [24] Kanungo, T.; Mount, D.M.; Netanyahu, N.S.; Piatko, C.; Silverman, R.; Wu, A.Y., An efficient k-means clustering algorithm: analysis and implementation, IEEE trans. patt. anal. Mach. intell., 24, (2002) [25] Kaufman, L.; Rousseeuw, P.J., Finding groups in data: an introduction to cluster analysis, (1990), Wiley New York · Zbl 1345.62009 [26] Kirkpatrick, S.; Gelatt, C.D.; Vecchi, M.P., Optimization by simulated annealing, Science, 220, 671-680, (1983) · Zbl 1225.90162 [27] Kohonen, T., Self-organization and associative memory, (1989), Springer-Verlag New York · Zbl 0528.68062 [28] Kolliopoulos, S.; Rao, S., A nearly linear-time approximation scheme for the euclidean k-Median problem, (), 362-371 [29] Korupolu, M.; Plaxton, C.; Rajaraman, R., Analysis of a local search heuristic for facility location problems, (), 1-10 · Zbl 0922.90100 [30] Lloyd, S.P., Least squares quantization in PCM, IEEE trans. inform. theory, 28, 129-137, (1982) · Zbl 0504.94015 [31] MacQueen, J., Some methods for classification and analysis of multivariate observations, (), 281-296 · Zbl 0214.46201 [32] Matoušek, J., On approximate geometric k-clustering, Discrete comput. geom., 24, 61-84, (2000) · Zbl 0959.68126 [33] Mettu, R.R.; Plaxton, C.G., Optimal time bounds for approximate clustering, (), 339-348 · Zbl 1069.68091 [34] Ng, R.T.; Han, J., Efficient and effective clustering methods for spatial data mining, (), 144-155 [35] Pelleg, D.; Moore, A., Accelerating exact k-means algorithms with geometric reasoning, (), 277-281 [36] Pelleg, D.; Moore, A., x-means: extending k-means with efficient estimation of the number of clusters, () [37] Phillips, S.J., Acceleration of k-means and related clustering problems, () · Zbl 1014.68581 [38] Selim, S.Z.; Ismail, M.A., K-means-type algorithms: A generalized convergence theorem and characterization of local optimality, IEEE trans. patt. anal. Mach. intell., 6, 81-87, (1984) · Zbl 0546.62037 [39] Sharir, M., A near-linear algorithm for the planar 2-center problem, Discrete comput. geom., 18, 125-134, (1997) · Zbl 0878.68131 [40] Thorup, M., Quick k-Median, k-center, and facility location for sparse graphs, (), 249-260 · Zbl 0986.90067 [41] Vaisey, J.; Gersho, A., Simulated annealing and codebook design, (), 1176-1179 [42] Wesolowsky, G., The Weber problem: history and perspective, Location sci., 1, 5-23, (1993) · Zbl 0923.90110
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.