×

Decomposition/aggregation \(k\)-means for big data. (English) Zbl 1460.90153

Kochetov, Yury (ed.) et al., Mathematical optimization theory and operations research. 19th international conference, MOTOR 2020, Novosibirsk, Russia, July 6–10, 2020. Revised selected papers. Cham: Springer. Commun. Comput. Inf. Sci. 1275, 409-420 (2020).
Summary: Well-known and widely applied \(k\)-means clustering heuristic is used for solving Minimum Sum-of-Square Clustering problem. In solving large size problems, there are two major drawbacks of this technique: (i) since it has to process the large input dataset, it has heavy computational costs and (ii) it has a tendency to converge to one of the local minima of poor quality. In order to reduce the computational complexity, we propose a clustering technique that works on subsets of the entire dataset in a stream like fashion. Using different heuristics the algorithm transforms the Big Data into Small Data, clusters it and uses obtained centroids to initialize the original Big Data. It is especially sensitive for Big Data as the better initialization gives the faster convergence. This approach allows effective parallelization. The proposed technique evaluates dynamically parameters of clusters from sequential data portions (windows) by aggregating corresponding criteria estimates. With fixed clustering time our approach makes progress through a number of partial solutions and aggregates them in a better one. This is done in comparing to a single solution which can be obtained by regular k-means-type clustering on the whole dataset in the same time limits. Promising results are reported on instances from the literature and synthetically generated data with several millions of entities.
For the entire collection see [Zbl 1454.90004].

MSC:

90C27 Combinatorial optimization
90C59 Approximation methods and heuristics in mathematical programming

Software:

k-means++
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Comparing different clustering algorithms on toy datasets. http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
[2] Online clustering data sets uci. https://archive.ics.uci.edu/ml/datasets.html
[3] Comparison of the k-means and minibatch-kmeans clustering algorithms, January 2020. https://scikit-learn.org/stable/auto_examples/cluster/plot_mini_batch_kmeans.html
[4] Abbas, O.A.: Comparisons between data clustering algorithms. Int. Arab J. Inf. Technol. 5(3), 320-325 (2008). http://iajit.org/index.php?option=com_content&task=blogcategory&id=58&Itemid=281
[5] Alguwaizani, A., Hansen, P., Mladenovic, N., Ngai, E.: Variable neighborhood search for harmonic means clustering. Appl. Math. Model. 35, 2688-2694 (2011). https://doi.org/10.1016/j.apm.2010.11.032 · Zbl 1219.68129 · doi:10.1016/j.apm.2010.11.032
[6] Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (2007) · Zbl 1302.68273
[7] Bagirov, A.M.: Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recogn. 41(10), 3192-3199 (2008). https://doi.org/10.1016/j.patcog.2008.04.004. http://www.sciencedirect.com/science/article/pii/S0031320308001362 · Zbl 1147.68669 · doi:10.1016/j.patcog.2008.04.004
[8] Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5 (2012). https://doi.org/10.14778/2180912.2180915
[9] HajKacem, M.A.B., N’Cir, C.-E.B., Essoussi, N.: Overview of scalable partitional methods for big data clustering. In: Nasraoui, O., Ben N’Cir, C.-E. (eds.) Clustering Methods for Big Data Analytics. USL, pp. 1-23. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97864-2_1 · doi:10.1007/978-3-319-97864-2_1
[10] Capo, M., Pérez, A., Lozano, J.: An efficient approximation to the k-means clustering for massive data. Knowl.-Based Syst. 117 (2016). https://doi.org/10.1016/j.knosys.2016.06.031 · Zbl 1433.68330
[11] Chiang, M.C., Tsai, C.W., Yang, C.S.: A time-efficient pattern reduction algorithm for k-means clustering. Inf. Sci. 181, 716-731 (2011). https://doi.org/10.1016/j.ins.2010.10.008 · doi:10.1016/j.ins.2010.10.008
[12] Cui, X., Zhu, P., Yang, X., Li, K., Ji, C.: Optimized big data K-means clustering using MapReduce. J. Supercomput. 70(3), 1249-1259 (2014). https://doi.org/10.1007/s11227-014-1225-7 · doi:10.1007/s11227-014-1225-7
[13] Guha, S., Mishra, N., Motwani, R.: Clustering data streams. In: Annual Symposium on Foundations of Computer Science - Proceedings, pp. 169-186, October 2000. https://doi.org/10.1007/978-0-387-30164-8_127
[14] Heneghan, C.: A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recogn. Lett. 28, 965-973 (2007). https://doi.org/10.1016/j.patrec.2007.01.001 · doi:10.1016/j.patrec.2007.01.001
[15] Karlsson, C. (ed.): Handbook of Research on Cluster Theory. Edward Elgar Publishing, Cheltenham (2010)
[16] Krassovitskiy, A., Mussabayev, R.: Energy-based centroid identification and cluster propagation with noise detection. In: Nguyen, N.T., Pimenidis, E., Khan, Z., Trawiński, B. (eds.) ICCCI 2018. LNCS (LNAI), vol. 11055, pp. 523-533. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98443-8_48 · doi:10.1007/978-3-319-98443-8_48
[17] Mladenovic, N., Hansen, P., Brimberg, J.: Sequential clustering with radius and split criteria. Cent. Eur. J. Oper. Res. 21, 95-115 (2013). https://doi.org/10.1007/s10100-012-0258-3 · Zbl 1267.90178 · doi:10.1007/s10100-012-0258-3
[18] Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1177-1178. Association for Computing Machinery, January 2010. https://doi.org/10.1145/1772690.1772862
[19] Shah, S.A., Koltun, V.: Robust continuous clustering. Proc. Natl. Acad. Sci. 114(37), 9814-9819 (2017) · doi:10.1073/pnas.1700770114
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.