×

Discovering rare categories from graph streams. (English) Zbl 1411.68131

Summary: Nowadays, massive graph streams are produced from various real-world applications, such as financial fraud detection, sensor networks, wireless networks. In contrast to the high volume of data, it is usually the case that only a small percentage of nodes within the time-evolving graphs might be of interest to people. Rare category detection (RCD) is an important topic in data mining, focusing on identifying the initial examples from the rare classes in imbalanced data sets. However, most existing techniques for RCD are designed for static data sets, thus not suitable for time-evolving data. In this paper, we introduce a novel setting of RCD on time-evolving graphs. To address this problem, we propose two incremental algorithms, SIRD and BIRD, which are constructed upon existing density-based techniques for RCD. These algorithms exploit the time-evolving nature of the data by dynamically updating the detection models enabling a “time-flexible” RCD. Moreover, to deal with the cases where the exact priors of the minority classes are not available, we further propose a modified version named BIRD-LI based on BIRD. Besides, we also identify a critical task in RCD named query distribution, which targets to allocate the limited budget among multiple time steps, such that the initial examples from the rare classes are detected as early as possible with the minimum labeling cost. The proposed incremental RCD algorithms and various query distribution strategies are evaluated empirically on both synthetic and real data sets.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
62-07 Data analysis (statistics) (MSC2010)
68R10 Graph theory (including graph drawing) in computer science

Software:

OddBall
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Aggarwal, CC; Philip, SY, On clustering massive text and categorical data streams, Knowl Inf Syst, 24, 171-196, (2010)
[2] Akoglu, Leman; McGlohon, Mary; Faloutsos, Christos, oddball: Spotting Anomalies in Weighted Graphs, 410-421, (2010), Berlin, Heidelberg
[3] Akoglu, Leman; Khandekar, Rohit; Kumar, Vibhore; Parthasarathy, Srinivasan; Rajan, Deepak; Wu, Kun-Lung, Fast Nearest Neighbor Search on Large Time-Evolving Graphs, 17-33, (2014), Berlin, Heidelberg
[4] Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 44-54
[5] Berlingerio M, Koutra D, Eliassi-Rad T, Faloutsos C (2012) Netsimile: a scalable approach to size-independent network similarity. In: arXiv:1209.2684
[6] Bettencourt LM, Hagberg AA, Larkey LB (2007) Separating the wheat from the chaff: practical anomaly detection schemes in ecological applications of distributed sensor networks. In: Distributed computing in sensor systems, Springer, New York, pp 223-239
[7] Dasgupta S, Hsu D (2008) Hierarchical sampling for active learning. In: International conference on machine learning, ACM, New York, pp 208-215
[8] Davis M, Liu W, Miller P, Redpath G (2011) Detecting anomalies in graphs with numeric labels. In: ACM international conference on information and knowledge management, ACM, New York, pp 1197-1202
[9] Eberle, W.; Graves, J.; Holder, L., Insider threat detection using a graph-based approach, J Appl Secur Res, 6, 32-81, (2010)
[10] Fan, W.; Wang, X.; Wu, Y., Incremental graph pattern matching, ACM Trans Database Syst, 38, 18, (2013) · Zbl 1321.68243
[11] Franke C, Gertz M (2008) Detection and exploration of outlier regions in sensor data streams. In: IEEE international conference on data mining workshops, IEEE, Los Alamitos, pp 375-384
[12] Gao J, Liang F, Fan W, Wang C, Sun Y, Han J (2010) On community outliers and their efficient detection in information networks. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 813-822
[13] Gupta, M.; Gao, J.; Aggarwal, C.; Han, J., Outlier detection for temporal data, Synth Lect Data Min Knowl Discov, 5, 1-129, (2014) · Zbl 1307.62002
[14] Gupte M, Eliassi-Rad T (2012) Measuring tie strength in implicit social networks. In: Annual ACM web science conference, ACM, New York, pp 109-118
[15] He J, Carbonell JG (2007) Nearest-neighbor-based active learning for rare category detection. In: Advances in neural information processing systems, pp 633-640
[16] He J, Liu Y, Lawrence R (2008) Graph-based rare category detection. In: IEEE international conference on data mining, IEEE, pp 833-838
[17] He J, Tong H, Carbonell J (2010) Rare category characterization. In: IEEE international conference on data mining, IEEE, pp 226-235
[18] Henderson K, Eliassi-Rad T, Faloutsos C, Akoglu L, Li L, Maruhashi K, Prakash BA, Tong H (2010) Metric forensics: a multi-level approach for mining volatile graphs. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, New York, pp 163-172
[19] Hill DJ, Minsker BS, Amir E (2007) Real-time bayesian anomaly detection for environmental sensor data. In: Congress-international association for hydraulic research, Citeseer, vol 32, p 503
[20] Kang U, McGlohon M, Akoglu L, Faloutsos C (2010) Patterns on the connected components of terabyte-scale graphs. In: IEEE international conference on data mining, IEEE, pp 875-880
[21] Kang, U.; Tsourakakis, CE; Appel, AP; Faloutsos, C.; Leskovec, J., Hadi: mining radii of large graphs, ACM Trans Knowl Discov Data, 5, 8, (2011)
[22] Koutra, Danai; Ke, Tai-You; Kang, U.; Chau, Duen Horng; Pao, Hsing-Kuo Kenneth; Faloutsos, Christos, Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms, 245-260, (2011), Berlin, Heidelberg
[23] Koutra D, Papalexakis EE, Faloutsos C (2012) Tensorsplat: spotting latent anomalies in time. In: Panhellenic conference on informatics, IEEE, pp 144-149
[24] Kumar R, Mahdian M, McGlohon M (2010) Dynamics of conversations. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 553-562
[25] Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp 177-187
[26] Liu, Z.; Chiew, K.; He, Q.; Huang, H.; Huang, B., Prior-free rare category detection: more effective and efficient solutions, Expert Syst Appl, 41, 7691-7706, (2014)
[27] Müller E, Sánchez PI, Mülle Y, Böhm K (2013) Ranking outlier nodes in subspaces of attributed graphs. In: IEEE international conference on data engineering workshops, IEEE, pp 216-222
[28] Pelleg D, Moore AW (2004) Active learning for anomaly and rare-category detection. In: Advances in neural information processing systems, pp 1073-1080
[29] Phua C, Lee V, Smith K, Gayler R (2010) A comprehensive survey of data mining-based fraud detection research. arXiv:hep-th/10096119
[30] Sherman, J.; Morrison, WJ, Adjustment of an inverse matrix corresponding to a change in one element of a given matrix, Annals Math Stat, 21, 124-127, (1950) · Zbl 0037.00901
[31] Sricharan K, Das K (2014) Localizing anomalous changes in time-evolving graphs. In: ACM SIGMOD international conference on management of data, ACM, pp 1347-1358
[32] Tong, Hanghang; Papadimitriou, Spiros; Yu, Philip S.; Faloutsos, Christos, Proximity Tracking on Time-Evolving Bipartite Graphs, 704-715, (2008), Philadelphia, PA
[33] Yamanishi K, Takeuchi Ji (2002) A unifying framework for detecting outliers and change points from non-stationary time series data. In: ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 676-681
[34] Yamanishi, K.; Takeuchi, JI; Williams, G.; Milne, P., On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, Data Min Knowl Discov, 8, 275-300, (2004)
[35] Zhou D, He J, Candan K, Davulcu H (2015a) Muvir: Multi-view rare category detection. In: International joint conference on artificial intelligence, pp 4098-4104
[36] Zhou D, Wang K, Cao N, He J (2015b) Rare category detection on time-evolving graphs. In: IEEE international conference on data mining, IEEE, pp 1135-1140
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.