×

Anytime parallel density-based clustering. (English) Zbl 1416.62351

Summary: The density-based clustering algorithm DBSCAN is a state-of-the-art data clustering technique with numerous applications in many fields. However, DBSCAN requires neighborhood queries for all objects and propagation of labels from object to object. This scheme is time consuming and thus limits its applicability for large datasets. In this paper, we propose a novel anytime approach to cope with this problem by reducing both the range query and the label propagation time of DBSCAN. Our algorithm, called AnyDBC, compresses the data into smaller density-connected subsets called primitive clusters and labels objects based on connected components of these primitive clusters to reduce the label propagation time. Moreover, instead of passively performing range queries for all objects as in existing techniques, AnyDBC iteratively and actively learns the current cluster structure of the data and selects a few most promising objects for refining clusters at each iteration. Thus, in the end, it performs substantially fewer range queries compared to DBSCAN while still satisfying the cluster definition of DBSCAN. Moreover, by processing queries in block and merging the results into the current cluster structure, AnyDBC can be efficiently parallelized on shared memory architectures to further accelerate the performance, uniquely making it a parallel and anytime technique at the same time. Experiments show speedup factors of orders of magnitude compared to DBSCAN and its fastest variants as well as a high parallel scalability on multicore processors for very large real and synthetic complex datasets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Aggarwal CC, Reddy CK (eds) (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton · Zbl 1331.62026
[2] Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: International conference on management of data (SIGMOD), pp 49-60
[3] Arlia D, Coppola M (2001) Experiments in parallel clustering with DBSCAN. In: International Euro-par conference, pp 326-331 · Zbl 1005.68633
[4] Assent I, Kranen P, Baldauf C, Seidl T (2012) AnyOut: anytime outlier detection on streaming data. In: International conference on database systems for advanced applications (DASFAA) (1), pp 228-242
[5] Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509-517 · Zbl 0306.68061 · doi:10.1145/361002.361007
[6] Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: International conference on information and knowledge management (CIKM), pp 661-670
[7] Böhm C, Feng J, He X, Mai ST, Plant C, Shao J (2011) A novel similarity measure for fiber clustering using longest common subsequence. In: Proceedings of the 2011 workshop on data mining for medicine and healthcare, pp 1-9
[8] Borah B, Bhattacharyya DK (2004) An improved sampling-based DBSCAN for large spatial databases. In: International conference on intelligent sensing and information processing (ICISIP), pp 92-96
[9] Brecheisen S, Kriegel H, Pfeifle M (2004) Efficient density-based clustering of complex objects. In: IEEE international conference on data mining (ICDM), pp 43-50
[10] Brecheisen S, Kriegel H, Pfeifle M (2006a) Parallel density-based clustering of complex objects. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 179-188
[11] Brecheisen S, Kriegel HP, Pfeifle M (2006b) Multi-step density-based clustering. Knowl. Inf Syst 9(3):284-308
[12] Chen L, Ng RT (2004) On the marriage of Lp-norms and edit distance. In: Very large data bases (VLDB), pp 792-803
[13] Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT Press, Cambridge · Zbl 1187.68679
[14] Dai BR, Lin IC (2012) Efficient map/reduce-based DBSCAN algorithm with optimized data partition. In: IEEE international conference on cloud computing (CLOUD), pp 59-66
[15] Dash M, Liu H, Xu X (2001) ‘\[1 + 1 > 21+1>2\]’: Merging distance and density based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 32-39
[16] Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 226-231
[17] Fonollosa J, Sheik S, Huerta R, Marco S (2015) Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sensors Actuators B: Chem 215:618-629 · doi:10.1016/j.snb.2015.03.028
[18] Francis Z, Villagrasa C, Clairand I (2011) Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput Methods Programs Biomed 101(3):265-270 · doi:10.1016/j.cmpb.2010.12.012
[19] Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: International conference on management of data (SIGMOD), pp 519-530
[20] Gan J, Tao Y (2017) On the hardness and approximation of Euclidean DBSCAN. ACM Trans Database Syst 42(3):14:1-14:45 · Zbl 1474.68420 · doi:10.1145/3083897
[21] Götz M, Bodenstein C, Riedel M (2015) HPDBSCAN: highly parallel DBSCAN. In: Proceedings of the workshop on machine learning in high-performance computing environments, pp 2:1-2:10
[22] Greiner J (1994) A comparison of parallel algorithms for connected components. In: Proceedings of the 6th annual ACM symposium on parallel algorithms and architectures (SSPA), pp 16-25
[23] Gunawan A (2013) A faster algorithm for DBSCAN. Msc thesis, TU Eindhoven
[24] He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: International conference on parallel and distributed systems (ICPADS), pp 473-480
[25] Januzaj E, Kriegel HP, Pfeifle M (2004) Scalable density-based distributed clustering. In: European conference on principles of data mining and knowledge discovery (PKDD), pp 231-244
[26] Kobayashi T, Iwamura M, Matsuda T, Kise K (2013) An anytime algorithm for camera-based character recognition. In: International conference on document analysis and recognition (ICDAR), pp 1140-1144
[27] Kriegel HP, Pfeifle M (2005) Density-based clustering of uncertain data. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 672-677
[28] Kriegel H, Schubert E, Zimek A (2017) The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl Inf Syst 52(2):341-378 · doi:10.1007/s10115-016-1004-2
[29] Kumar V (2002) Introduction to parallel computing, 2nd edn. Addison-Wesley, Boston
[30] Li T, Heinis T, Luk W (2016) Hashing-based approximate DBSCAN. In: Symposium on advances in databases and information systems (ADBIS), pp 31-45
[31] Li T, Heinis T, Luk W (2017) ADvaNCE—efficient and scalable approximate density-based clustering based on hashing. Informatica 28(1):105-130 · doi:10.15388/Informatica.2017.122
[32] Lulli A, Dell’Amico M, Michiardi P, Ricci L (2016) NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc VLDB Endow (PVLDB) 10(3):157-168 · doi:10.14778/3021924.3021932
[33] Mahran S, Mahar K (2008) Using grid for accelerating density-based clustering. In: IEEE international conference on computer and information technology (CIT), pp 35-40
[34] Mai ST, Goebl S, Plant C (2012) A similarity model and segmentation algorithm for white matter fiber tracts. In: IEEE international conference on data mining (ICDM), pp 1014-1019
[35] Mai ST, He X, Feng J, Böhm C (2013a) Efficient anytime density-based clustering. In: SIAM international conference on data mining (SDM), pp 112-120
[36] Mai ST, He X, Hubig N, Plant C, Böhm C (2013b) Active density-based clustering. In: IEEE international conference on data mining (ICDM), pp 508-517
[37] Mai ST, He X, Feng J, Plant C, Böhm C (2015) Anytime density-based clustering of complex data. Knowl Inf Syst 45(2):319-355 · doi:10.1007/s10115-014-0797-0
[38] Mai ST, Assent I, Le A (2016a) Anytime OPTICS: an efficient approach for hierarchical density-based clustering. In: International conference on database systems for advanced applications (DASFAA), pp 164-179
[39] Mai ST, Assent I, Storgaard M (2016b) AnyDBC: an efficient anytime density-based clustering algorithm for very large complex datasets. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 1025-1034
[40] Mai ST, Dieu MS, Assent I, Jacobsen J, Kristensen J, Birk M (2017) Scalable and interactive graph clustering algorithm on multicore CPUs. In: IEEE international conference on data engineering (ICDE), pp 349-360
[41] Patwary MMA, Palsetia D, Agrawal A, Liao W, Manne F, Choudhary AN (2012) A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), p 62
[42] Patwary MMA, Satish N, Sundaram N, Manne F, Habib S, Dubey P (2014) Pardicle: parallel approximate density-based clustering. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC), pp 560-571
[43] Reiss A, Stricker D (2012) Introducing a new benchmarked dataset for activity monitoring. In: International symposium on wearable computers (ISWC), pp 108-109
[44] Sakai T, Tamura K, Kitakami H (2017) Cell-based DBSCAN algorithm using minimum bounding rectangle criteria. In: International conference on database systems for advanced applications (DASFAA), pp 133-144
[45] Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169-194 · doi:10.1023/A:1009745219419
[46] Schubert E, Sander J, Ester M, Kriegel H, Xu X (2017) DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):19:1-19:21 · doi:10.1145/3068335
[47] Tramacere A, Vecchio \[C (2013) \gamma\] γ-Ray DBSCAN: a clustering algorithm applied to fermi-LAT \[\gamma\] γ-ray data—I. Detection performances with real and simulated data. Astron Astrophys 549:A138 · doi:10.1051/0004-6361/201220133
[48] Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: International conference on machine learning (ICML), pp 1073-1080
[49] Wang X, Hamilton HJ (2003) DBRS: a density-based spatial clustering method with random sampling. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 563-575 · Zbl 1032.68636
[50] Xu X, Jäger J, Kriegel HP (1999) A fast parallel clustering algorithm for large spatial databases. Data Min Knowl Discov 3(3):263-290 · doi:10.1023/A:1009884809343
[51] Zaki MJ, M W Jr (2014) Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, New York · doi:10.1017/CBO9780511810114
[52] Zhao W, Hopke PK, Prather KA (2008) Comparison of two cluster analysis methods using single particle mass spectra. Atmos Environ 42(5):881-892 · doi:10.1016/j.atmosenv.2007.10.024
[53] Zhou S, Zhou A, Cao J, Jin W, Fan Y, Hu Y (2000) Combining sampling technique with DBSCAN algorithm for clustering large spatial databases. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 169-172
[54] Zilberstein S (1996) Using anytime algorithms in intelligent systems. AI Mag 17(3):73-83
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.