×

On strategies to fix degenerate \(k\)-means solutions. (English) Zbl 1373.62300

Summary: \(k\)-means is a benchmark algorithm used in cluster analysis. It belongs to the large category of heuristics based on location-allocation steps that alternately locate cluster centers and allocate data points to them until no further improvement is possible. Such heuristics are known to suffer from a phenomenon called degeneracy in which some of the clusters are empty. In this paper, we compare and propose a series of strategies to circumvent degenerate solutions during a \(k\)-means execution. Our computational experiments show that these strategies are effective, leading to better clustering solutions in the vast majority of the cases in which degeneracy appears in \(k\)-means. Moreover, we compare the use of our fixing strategies within \(k\)-means against the use of two initialization methods found in the literature. These results demonstrate how useful the proposed strategies can be, specially inside memory-based clustering algorithms.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] ALOISE, D., DESHPANDE, A., HANSEN, P., and POPAT, P. (2009), “NP-Hardness of Euclidean Sum-of-Squares Clustering”, Machine Learning, 75, 245-249. · Zbl 1378.68047 · doi:10.1007/s10994-009-5103-0
[2] ALOISE, D., HANSEN, P., and LIBERTI, L. (2012), “An Improved Column Generation Algorithm for Minimum Sum-of-Squares Clustering”, Mathematical Programming, 131(1-2), 195-220. · Zbl 1236.90095 · doi:10.1007/s10107-010-0349-7
[3] ARTHUR, D., and VASSILVITSKII, S. (2007). “K-means++: The Advantages of Careful Seeding”, In 2007 ACM-SIAM Symposium on Discrete Algorithms (SODA’07), pp. 1027-1035. · Zbl 1302.68273
[4] BLANCHARD, S.J., ALOISE, D., and DESARBO, W.S. (2012), “The Heterogeneous PMedian Problem for Categorization Based Clustering”, Psychometrika, 77(4), 741-762. · Zbl 1284.62686 · doi:10.1007/s11336-012-9283-3
[5] BRADLEY, P.S., and FAYYAD, U.M. (1998), “Refining Initial Points for <Emphasis Type=”Italic“>k-Means Clustering”, in International Conference on Machine Learning (ICML), Vol. 98, pp. 91-99.
[6] BRIMBERG, J., and MLADENOVIĆ, N. (1999), “Degeneracy in the Multi-Source Weber Problem”, Mathematical Programming, 85(1), 213-220. · Zbl 0956.90016 · doi:10.1007/s101070050054
[7] BRUSCO, M.J., and STEINLEY, D. (2007), “A Comparison of Heuristic Procedures for MinimumWithin-Cluster Sums of Squares Partitioning”, Psychometrika, 72(4), 583-600. · Zbl 1291.62196 · doi:10.1007/s11336-007-9013-4
[8] CARRIZOSA, E., ALGUWAIZANI, A., HANSEN, P., and MLADENOVIĆ, N. (2015), “New Heuristic for Harmonic Means Clustering”, Journal of Global Optimization, 63, 427-443. · Zbl 1334.90137 · doi:10.1007/s10898-014-0175-1
[9] CHOROMANSKA, A., and MONTELEONI, C. (2012), “Online Clustering with Experts”, in International Conference on Artificial Intelligence and Statistics, pp. 227-235.
[10] COOPER, L. (1964), “Heuristic Methods for Location-Allocation Problems”, Siam Review, 6(1), 37-53. · Zbl 0956.90014 · doi:10.1137/1006005
[11] DING, Y., ZHAO, Y., SHEN, X., MUSUVATHI, M., and MYTKOWICZ, T. (2015), “Yinyang <Emphasis Type=”Italic“>k-means: A Drop-in Replacement of the Classic <Emphasis Type=”Italic“>k-Means with Consistent Speedup”, in 32nd International Conference on Machine Learning (ICML-15), pp. 579-587.
[12] EILON, S., WATSON-GANDY, C., and CHRISTOFIDES, N. (1971), Distributed Management, New York: Hafner. · Zbl 1012.68873
[13] FORGY, E. (1965), “Cluster Analysis of Multivariate Data: Efficiency vs. Interpretability of Classifications”, Biometrics, 21, 768.
[14] HANSEN, P., and Mladenović, N. (2001), “J-Means: A New Local Search Heuristic for Minimum Sum of Squares Clustering”, Pattern Recognition, 34, 405-413. · Zbl 1012.68873 · doi:10.1016/S0031-3203(99)00216-2
[15] HANSEN, P., NGAI, E., CHEUNG, B.K., and MLADENOVIC, N. (2005), “Analysis of Global <Emphasis Type=”Italic“>k-Means, An Incremental Heuristic for Minimum Sum-of-Squares Clustering”, Journal of Classification, 22(2), 287-310. · Zbl 1336.62182 · doi:10.1007/s00357-005-0018-3
[16] HAVERLY, C.A. (1978), “Studies of the Behavior of Recursion for the Pooling Problem”, ACM SIGMAP Bulletin, (25), 19-28. · doi:10.1145/1111237.1111238
[17] HELSEN, K., and GREEN, P.E. (1991), “A Computational Study of Replicated Clustering with an Application toMarket Segmentation”, Decision Sciences, 22(5), 1124-1141. · doi:10.1111/j.1540-5915.1991.tb01910.x
[18] HOFMANS, J., CEULEMANS, E., STEINLEY, D., and VAN MECHELEN, I. (2015), “On the Added Value of Bootstrap Analysis for <Emphasis Type=”Italic“>k-Means Clustering”, Journal of Classification, 32(2), 268-284. · Zbl 1335.62099 · doi:10.1007/s00357-015-9178-y
[19] INABA, M., KATOH, N., and IMAI, H. (1994), “Applications ofWeighted Voronoi Diagrams and Randomization to Variance-Based <Emphasis Type=”Italic“>k-Clustering”, in Proceedings of the 10th ACM Symposium on Computational Geometry, pp. 332-339. · Zbl 1211.68212
[20] JAIN, R. (2008), The Art of Computer Systems Performance Analysis, New York: John Wiley and Sons.
[21] LICHMAN, M. (2013), UCI Machine Learning Repository, Irvine, CA: University of California, School of Information and Computer Science, http://archive.ics.uci.edu/ml. · Zbl 1334.90137
[22] MACQUEEN, J. (1967), “Some Methods for Classification and Analysis of Multivariate Observations”, in Proceedings of 5thBerkeley Symposium on Mathematical Statistics and Probability, Vol. 2, Berkely, CA, pp. 281-297. · Zbl 0214.46201
[23] MAHAJAN, M., NIMBHORKAR, P., and VARADARAJAN, K. (2009), “The Planar <Emphasis Type=”Italic“>k-Means Problem is NP-Hard”, Lecture Notes in Computer Science, 5431, 274-285. · Zbl 1211.68212 · doi:10.1007/978-3-642-00202-1_24
[24] MAIRAL, J., BACH, F., and PONCE, J. (2012), “Sparse Modeling for Image and Vision Processing”, Foundations and Trends in Computer Graphics and Vision, 8(2-3), 85-283. · Zbl 1333.68263 · doi:10.1561/0600000058
[25] MAK, J.N., and WOLPAW, J.R. (2009), “Clinical Applications of Brain-Computer Interfaces: Current State and Future Prospects”, IEEE Reviews in Biomedical Engineering, 2, 187-199. · doi:10.1109/RBME.2009.2035356
[26] NUGENT, R., DEAN, N., and AYERS, E. (2010), “Skill Set Profile Clustering: The Empty <Emphasis Type=”Italic“>k-Means Algorithm with Automatic Specification of Starting Cluster Centers”, in Educational Data Mining 2010, pp. 151-160.
[27] ORDIN, B., and BAGIROV, A.M. (2015), “A Heuristic Algorithm for Solving the Minimum Sum-of-Squares Clustering Problems”, Journal of Global Optimization, 61(2), 341-361. · Zbl 1311.90111 · doi:10.1007/s10898-014-0171-5
[28] PACHECO, J., and VALENCIA, O. (2003), “Design of Hybrids for the Minimum Sum-of-Squares Clustering Problem”, Computational Statistics and Data Analysis, 43(2), 235-248. · Zbl 1429.65124 · doi:10.1016/S0167-9473(02)00224-4
[29] RUSPINI, E. (1970), “Numerical Method for Fuzzy Clustering”, Information Sciences, 2, 319-350. · Zbl 0205.21301
[30] STEINLEY, D. (2006), “K-Means Clustering: A Half-Century Synthesis”, British Journal of Mathematical and Statistical Psychology, 59(1), 1-34. · doi:10.1348/000711005X48266
[31] STEINLEY, D., and BRUSCO, M.J. (2007), “Initializing <Emphasis Type=”Italic“>k-Means Batch Clustering: A Critical Evaluation of Several Techniques”, Journal of Classification, 24(1), 99-121. · Zbl 1144.62331 · doi:10.1007/s00357-007-0003-0
[32] TAO, P.D. et al. (2014), “New and Efficient Dca Based Algorithms for Minimum Sum-of-Squares Clustering”, Pattern Recognition, 47(1), 388-401. · Zbl 1326.68225 · doi:10.1016/j.patcog.2013.07.012
[33] TEBOULLE, M. (2007), “A Unified Continous Optimization Framework for Center-Based Clustering Methods”, Journal of Machine Learning Research, (8), 65-102. · Zbl 1222.68318
[34] WARD JR., J.H. (1963), “Hierarchical Grouping to Optimize an Objective Function”, Journal of the American Statistical Association, 58(301), 236-244.
[35] WU, X., and KUMAR, V. (2009), The Top Ten Algorithms in Data Mining, CRC Press. · Zbl 1179.68129
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.