×

Equi-Clustream: a framework for clustering time evolving mixed data. (English) Zbl 1416.62358

Summary: In data stream environment, most of the conventional clustering algorithms are not sufficiently efficient, since large volumes of data arrive in a stream and these data points unfold with time. The problem of clustering time-evolving metric data and categorical time-evolving data has separately been well explored in recent years, but the problem of clustering mixed type time-evolving data remains a challenging issue due to an awkward gap between the structure of metric and categorical attributes. In this paper, we devise a generalized framework, termed Equi-Clustream to dynamically cluster mixed type time-evolving data, which comprises three algorithms: a Hybrid Drifting Concept Detection Algorithm that detects the drifting concept between the current sliding window and previous sliding window, a Hybrid Data Labeling Algorithm that assigns an appropriate cluster label to each data vector of the current non-drifting window based on the clustering result of the previous sliding window, and a visualization algorithm that analyses the relationship between the clusters at different timestamps and also visualizes the evolving trends of the clusters. The efficacy of the proposed framework is shown by experiments on synthetic and real world datasets.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Ackermann, MR; Märtens, M.; Raupach, C.; Swierkot, K.; Lammersen, C.; Sohler, C., Streamkm++: a clustering algorithm for data streams, J Exp Algorithm, 17, 2-4, (2012) · Zbl 1284.68234 · doi:10.1145/2133803.2184450
[2] Aggarwal, CC; Philip, SY, On clustering massive text and categorical data streams, Knowl Inf Syst, 24, 171-196, (2010) · doi:10.1007/s10115-009-0241-z
[3] Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very Large Data Bases, VLDB Endowment, Berlin, Germany, 9-12 September, 2003. VLDB, vol 29, pp 81-92
[4] Ahmad, A.; Dey, L., A k-mean clustering algorithm for mixed numeric and categorical data, Data Knowl Eng, 63, 503-527, (2007) · doi:10.1016/j.datak.2007.03.016
[5] Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, New Orleans, Louisiana, 7-9 January, 2007. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA. pp 1027-1035 · Zbl 1302.68273
[6] Bache K, Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 26 Aug 2014
[7] Bhatnagar, V.; Kaur, S.; Chakravarthy, S., Clustering data streams using grid-based synopsis, Knowl Inf Syst, 41, 127-152, (2014) · doi:10.1007/s10115-013-0659-1
[8] Can-Shi Z, Xiao D, Lin Z (2011) A study on the application of data stream clustering mining through a sliding and damped window to intrusion detection. In: 4th International conference on information and computing (ICIC), Phuket Island, Thailand, 25-27 April, 2011. IEEE Computer Society, pp 22-26
[9] Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 6th SIAM international conference on data mining (SDM), Bethesda, MD, USA, 20-22 April, 2006. SIAM, vol 6, pp 326-337
[10] Cao, F.; Liang, J.; Bai, L.; Zhao, X.; Dang, C., A framework for clustering categorical timeevolving data, IEEE Trans Fuzzy Syst, 18, 872-882, (2010) · doi:10.1109/TFUZZ.2010.2050891
[11] Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA, 20-23 August, 2006. ACM, pp 554-560
[12] Cheeseman, P.; Stutz, J.; Fayyad, UM (ed.); etal., Bayesian classification (AutoClass): theory and results, 153-180, (1996), Menlo Park
[13] Chen, HL; Chen, MS; Lin, SC, Catching the trend: a framework for clustering conceptdrifting categorical data, IEEE Trans Knowl Data Eng, 21, 652-665, (2009) · doi:10.1109/TKDE.2008.192
[14] Chen, L.; Zou, LJ; Tu, L., A clustering algorithm for multiple data streams based on spectral component similarity, Inf Sci, 183, 35-47, (2012) · doi:10.1016/j.ins.2011.09.004
[15] Cheung, YM; Jia, H., Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number, Pattern Recognit, 46, 2228-2238, (2013) · Zbl 1316.68125 · doi:10.1016/j.patcog.2013.01.027
[16] Chi Y, Song X, Zhou D, Hino K, Tseng BL (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, CA, USA, 12-15 August, 2007. ACM, pp 153-162
[17] Chi Y, Song X, Zhou D, Hino K, Tseng BL (2010) Evolutionary spectral clustering by incorporating temporal smoothness. US Patent 7,831,538, 9 Nov 2010
[18] Dai, BR; Huang, JW; Yeh, MY; Chen, MS, Adaptive clustering for multiple evolving streams, IEEE Trans Knowl Data Eng, 18, 1166-1180, (2006) · doi:10.1109/TKDE.2006.137
[19] David, G.; Averbuch, A., Spectralcat: categorical spectral clustering of numerical and nominal data, Pattern Recognit, 45, 416-433, (2012) · Zbl 1225.68171 · doi:10.1016/j.patcog.2011.07.006
[20] Dubes, R.; Jain, AK, Clustering methodologies in exploratory data analysis, Adv Comput, 19, 113-228, (1980) · doi:10.1016/S0065-2458(08)60034-0
[21] Forestiero A, Pizzuti C, Spezzano G (2009) Flockstream: a bio-inspired algorithm for clustering evolving data streams. In: Proceeding of the 21st international conference on tools with artificial intelligence (ICTAI’09), Newark, New Jersey, 2-5 November, 2009. IEEE Computer Society, pp 1-8
[22] Gaber, MM; Yu, PS, Detection and classification of changes in evolving data streams, Int J Inf Technol Decis Mak, 5, 659-670, (2006) · doi:10.1142/S0219622006002179
[23] Golab, L.; Özsu, MT, Issues in data stream management, ACM Sigmod Record, 32, 5-14, (2003) · doi:10.1145/776985.776986
[24] Guha, S.; Meyerson, A.; Mishra, N.; Motwani, R.; O’Callaghan, L., Clustering data streams: theory and practice, IEEE Trans Knowl Data Eng, 15, 515-528, (2003) · doi:10.1109/TKDE.2003.1198387
[25] He, Z.; Xu, X.; Deng, S., Scalable algorithms for clustering large datasets with mixed type attributes, Int J Intell Syst, 20, 1077-1089, (2005) · Zbl 1101.68810 · doi:10.1002/int.20108
[26] Hsu, CC; Chen, YC, Mining of mixed data with application to catalog marketing, Expert Syst Appl, 32, 12-23, (2007) · doi:10.1016/j.eswa.2005.11.017
[27] Huang, Z., Extensions to the k-means algorithm for clustering large data sets with categorical values, Data Min Knowl Discov, 2, 283-304, (1998) · doi:10.1023/A:1009769707641
[28] Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River · Zbl 0665.62061
[29] Ji, J.; Bai, T.; Zhou, C.; Ma, C.; Wang, Z., An improved k-prototypes clustering algorithm for mixed numeric and categorical data, Neurocomputing, 120, 590-596, (2013) · doi:10.1016/j.neucom.2013.04.011
[30] Jiawei H, Kamber M (2006) Data mining: concepts and techniques, 2nd edn. Morgan Kaufmann, San Francisco · Zbl 1445.68004
[31] Khalilian M, Mustapha N (2010) Data stream clustering: challenges and issues. arXiv preprint arXiv:1006.5261
[32] Li, C.; Biswas, G., Unsupervised learning with mixed numeric and nominal data, IEEE Trans Knowl Data Eng, 14, 673-690, (2002) · doi:10.1109/TKDE.2002.1019208
[33] Luo, Huilan; Kong, Fansheng; Li, Yixiao, Clustering Mixed Data Based on Evidence Accumulation, 348-355, (2006), Berlin, Heidelberg · doi:10.1007/11811305_38
[34] Mellier R, Myoupo JF (2006) A weighted clustering algorithm for mobile ad hoc networks with non unique weights. In: Proceedings of 2nd international conference on wireless and mobile communications (ICWMC’06) Bucharest, Romania, 29-31 July, 2006. IEEE Computer Society, pp 39-44
[35] Nasraoui, Olfa; Rojas, Carlos, Robust Clustering for Tracking Noisy Evolving Data Streams, 619-623, (2006), Philadelphia, PA · doi:10.1137/1.9781611972764.72
[36] Nasraoui, O.; Soliman, M.; Saka, E.; Badia, A.; Germain, R., A web usage mining framework for mining evolving user profiles in dynamic web sites, IEEE Trans Knowl Data Eng, 20, 202-215, (2008) · doi:10.1109/TKDE.2007.190667
[37] Oh SH, Kang JS, Byun YC, Park GL, Byun SY (2005) Intrusion detection based on clustering a data stream. In: Proceedings of 3rd ACIS international conference on software engineering research, management and applications, Central Michigan University, Mount Pleasant, Michigan, USA, 11-13 August, 2005. IEEE Computer Society, pp 220-227
[38] Pawlak, Z., Rough sets, Int J Comput Inf Sci, 11, 341-356, (1982) · Zbl 0501.68053 · doi:10.1007/BF01001956
[39] Rokach, L.; Maimon, OZ (ed.); Rokach, L. (ed.), A survey of clustering algorithms, 269-298, (2010), Heidelberg · Zbl 1213.68237
[40] Sangam, RS; Om, H., Hybrid data labeling algorithm for clustering large mixed type data, J Intell Inf Syst, 45, 273-293, (2015) · doi:10.1007/s10844-014-0348-x
[41] Su, Q.; Chen, L., A method for discovering clusters of e-commerce interest patterns using click-stream data, Electron Commer Res Appl, 14, 1-13, (2015) · doi:10.1016/j.elerap.2014.10.002
[42] Yeh, MY; Dai, BR; Chen, MS, Clustering over multiple evolving streams by events and correlations, IEEE Trans Knowl Data Eng, 19, 1349-1362, (2007) · doi:10.1109/TKDE.2007.1071
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.