×

Expected similarity estimation for large-scale batch and streaming anomaly detection. (English) Zbl 1432.94044

Summary: We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (expose), is kernel-based and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with exposecan be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, exposecan make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being an order of magnitude faster than most other approaches.

MSC:

94A16 Informational aspects of data analysis and big data
62R07 Statistical aspects of big data and data science
62G05 Nonparametric estimation
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T10 Pattern recognition, speech recognition
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Aleskerov, E., Freisleben, B., & Rao, B. (1997). Cardwatch: A neural network based database mining system for credit card fraud detection. In Computational intelligence for financial engineering. doi:10.1109/cifer.1997.618940 · Zbl 1078.68728
[2] Aliprantis, C. D., & Border, K. (2006). Infinite dimensional analysis: A hitchhiker’s guide. Berlin: Springer Science & Business Media. doi:10.1007/3-540-29587-9. · Zbl 1156.46001 · doi:10.1007/3-540-29587-9
[3] Angiulli, F., & Fassetti, F. (2007). Detecting distance-based outliers in streams of data. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 811-820). New York: ACM. doi:10.1145/1321440.1321552. · Zbl 0884.65053
[4] Angiulli, F., & Fassetti, F. (2010). Distance-based outlier queries in data streams: The novel task and algorithms. Data Mining and Knowledge Discovery, 20(2), 290-324. doi:10.1007/s10618-009-0159-9. · doi:10.1007/s10618-009-0159-9
[5] Angiulli, F., & Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In PKDD, Vol. 2 (pp. 15-26). Berlin: Springer. doi:10.1007/3-540-45681-3_2. · Zbl 1020.68527
[6] Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavaldà, R. (2009). New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 139-148). ACM. doi:10.1145/1557019.1557041. · Zbl 1222.68184
[7] Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In ACM sigmod record, Vol. 29 (pp. 93-104). ACM. doi:10.1145/335191.335388. · JFM 63.1098.02
[8] Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1-58. · doi:10.1145/1541880.1541882
[9] Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1-27. · doi:10.1145/1961189.1961199
[10] Dawid, A. P. (1984). Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society Series A (General) 278-292. doi:10.2307/2981683. · Zbl 0557.62080
[11] Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. doi:10.1145/1327452.1327492. · doi:10.1145/1327452.1327492
[12] Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1-30. · Zbl 1222.68184
[13] Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71-80). ACM. doi:10.1145/347090.347107.
[14] Domingos, P., & Hulten, G. (2001). Catching up with the data: Research issues in mining data streams. In DMKD.
[15] Faria, E.R., Gama, J., & Carvalho, A.C. (2013). Novelty detection algorithm for data streams multi-class problems. In Proceedings of the 28th annual ACM symposium on applied computing (pp. 795-800). ACM. doi:10.1145/2480362.2480515.
[16] Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675-701. doi:10.1080/01621459.1937.10503522. · JFM 63.1098.02 · doi:10.1080/01621459.1937.10503522
[17] Gama, J. (2010). Knowledge discovery from data streams. Boca Raton: CRC Press. doi:10.1201/ebk1439826119-c1. · Zbl 1230.68017 · doi:10.1201/ebk1439826119-c1
[18] Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44. doi:10.1145/2523813. · Zbl 1305.68141 · doi:10.1145/2523813
[19] Goodman, J. E., & O’Rourke, J. (2004). Handbook of discrete and computational geometry (Vol. 2). Boca Raton: CRC Press. · Zbl 1056.52001
[20] Gretton, A., & Desobry, F. (2003). On-line one-class support vector machines. An application to signal segmentation. In Acoustics, speech, and signal processing, 2003. Proceedings (ICASSP’03). 2003 IEEE international conference on, IEEE, vol. 2, pp. 2-709. doi:10.1109/icassp.2003.1202465.
[21] Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1), 723-773. · Zbl 1283.62095
[22] Gupta, M., Gao, J., & Aggarwal, C. C. (2014). Outlier detection for temporal data: A survey. Knowledge and Data Engineering, IEEE Transactions on, 26(9), 2250-2267. doi:10.1109/tkde.2013.184. · doi:10.1109/tkde.2013.184
[23] Ho, S. S. (2005). A martingale framework for concept change detection in time-varying data streams. In Proceedings of the 22nd international conference on machine learning (pp. 321-327). ACM. doi:10.1145/1102351.1102392. · Zbl 1283.62095
[24] Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85-126. doi:10.1023/b:aire.0000045502.10941.a9. · Zbl 1101.68023 · doi:10.1023/b:aire.0000045502.10941.a9
[25] Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the Friedman statistic. Communications in Statistics-Theory and Methods, 9(6), 571-595. · Zbl 0451.62061 · doi:10.1080/03610928008827904
[26] Kallenberg, O. (2006). Foundations of modern probability. Berlin: Springer Science & Business Media. doi:10.1007/b98838. · Zbl 0892.60001 · doi:10.1007/b98838
[27] Kar, P., & Karnick, H. (2012). Random feature maps for dot product kernels. In International conference on artificial intelligence and statistics, pp. 583-591. · Zbl 0451.62061
[28] Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the international conference on very large data bases, citeseer (pp. 392-403).
[29] Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Slgorithms and applications. The VLDB Journal the International Journal on Very Large Data Bases, 8(3-4), 237-253. doi:10.1007/s007780050006. · doi:10.1007/s007780050006
[30] Kontaki, M., Gounaris, A., Papadopoulos, A. N., Tsichlas, K., & Manolopoulos, Y. (2011). Continuous monitoring of distance-based outliers over data streams. In Data engineering (ICDE), 2011 IEEE 27th international conference on, IEEE, pp. 135-146. doi:10.1109/icde.2011.5767923.
[31] Kriegel, H. P., & Zimek, A. (2008). Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 444-452). ACM. doi:10.1145/1401890.1401946.
[32] Kumar, V. (2005). Parallel and distributed computing for cybersecurity. IEEE Distributed Systems Online, 6(10), 1. doi:10.1109/mdso.2005.53. · doi:10.1109/mdso.2005.53
[33] Lazarescu, M. M., Venkatesh, S., & Bui, H. H. (2004). Using multiple windows to track concept drift. Intelligent Data Analysis, 8(1), 29-59.
[34] Le, Q., Sarlós, T., & Smola, A. J. (2013). Fastfood: Approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning. · Zbl 1222.68184
[35] Li, F., Ionescu, C., & Sminchisescu, C. (2010). Random Fourier approximations for skewed multiplicative histogram kernels. In Pattern recognition Accessed 1 Jan 2016. Berlin: Springer. doi:10.1007/978-3-642-15986-2_27.
[36] Lichman, M. (2013). UCI machine learning repository. Irvine: University of California, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml. Accessed 1 Jan 2016.
[37] Liu, F. T., Ting, K. M., & Zhou, Z. H. (2012). Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1), 3. doi:10.1145/2133360.2133363. · doi:10.1145/2133360.2133363
[38] Nemenyi, P. (1963). Distribution-free multiple comparisons. Ph.D. thesis, Princeton University.
[39] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, p. 4.
[40] Ott, L., Pang, L., Ramos, F. T., & Chawla, S. (2014). On integrated clustering and outlier detection. In Advances in neural information processing systems (pp. 1359-1367).
[41] Pokrajac, D. (2007). Incremental local outlier detection for data streams. IEEE symposium on computational intelligence and data mining (pp. 504-515). doi:10.1109/cidm.2007.368917.
[42] Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In Advances in neural information processing systems (pp. 1177-1184).
[43] Rahimi, A., & Recht, B. (2008). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313-1320).
[44] Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In ACM SIGMOD record, Vol. 29, pp. 427-438. ACM. doi:10.1145/335191.335437.
[45] Sadik, S., & Gruenwald, L. (2014). Research issues in outlier detection for data streams. ACM SIGKDD Explorations Newsletter, 15(1), 33-40. doi:10.1145/2594473.2594479. · doi:10.1145/2594473.2594479
[46] Schneider, M. (2016). Probability inequalities for kernel embeddings in sampling without replacement. In Proceedings of the nineteenth international conference on artificial intelligence and statistics.
[47] Schneider, M., Ertel, W., & Palm, G. (2015). Expected similarity estimation for large scale anomaly detection. In International joint conference on neural networks, IEEE, pp. 1-8. doi:10.1109/ijcnn.2015.7280331. · Zbl 1403.68211
[48] Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13(7), 1443-1471. doi:10.1162/089976601750264965. · Zbl 1009.62029 · doi:10.1162/089976601750264965
[49] Sejdinovic, D., Sriperumbudur, B., Gretton, A., & Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, 41(5), 2263-2291. doi:10.1214/13-aos1140. · Zbl 1281.62117 · doi:10.1214/13-aos1140
[50] Smola, A. J., Gretton, A., Song, L., & Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic learning theory. Berlin: Springer. doi:10.1007/978-3-540-75488-6_5. · Zbl 1142.68407
[51] Spence, C., Parra, L., & Sajda, P. (2001). Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model. In Mathematical methods in biomedical image analysis, 2001. MMBIA 2001. IEEE workshop on, IEEE, pp. 3-10. doi:10.1109/mmbia.2001.991693.
[52] Spinosa, E. J., de Leon, F., de Carvalho, A. P., & Gama, J. (2007). OLINDDA: A cluster-based approach for detecting novelty and concept drift in data streams. In Proceedings of the 2007 ACM symposium on applied computing (pp. 448-452). ACM.
[53] Steinwart, I. (2003). Sparseness of support vector machines. The Journal of Machine Learning Research, 4, 1071-1105. · Zbl 1094.68082
[54] Steinwart, I., & Christmann, A. (2008). Support vector machines. Berlin: Springer. · Zbl 1203.68171
[55] Tan, S. C., Ting, K. M., & Liu, T. F. (2011). Fast anomaly detection for streaming data. In IJCAI proceedings-international joint conference on artificial intelligence, Citeseer, Vol. 22, p. 1511.
[56] Tax, D. M. J. (2001). One-class classification. Ph.D. thesis, Technische Universiteit Delft. · Zbl 0980.68622
[57] Tax, D. M. J., & Duin, R. P. W. (2004). Support vector data description. Machine Learning, 54(1), 45-66. · Zbl 1078.68728 · doi:10.1023/B:MACH.0000008084.60811.49
[58] Torczon, V. (1997). On the convergence of pattern search algorithms. SIAM Journal on Optimization, 7(1), 1-25. · Zbl 0884.65053 · doi:10.1137/S1052623493250780
[59] Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3), 480-492. doi:10.1109/cvpr.2010.5539949. · doi:10.1109/cvpr.2010.5539949
[60] Vondrick, C., Khosla, A., Malisiewicz, T., & Torralba, A. (2013). Hoggles: Visualizing object detection features. In Computer vision (ICCV), 2013 IEEE international conference on, IEEE, pp. 1-8. doi:10.1109/iccv.2013.8. · Zbl 1009.62029
[61] Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159-196. · doi:10.1023/A:1007659514849
[62] Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69-101. doi:10.1007/bf00116900. · doi:10.1007/bf00116900
[63] Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems. MIT Press, EPFL-CONF-161322, pp. 682-688.
[64] Wu, K., Zhang, K., Fan, W., Edwards, A., & Yu, P. S. (2014). RS-forest: A rapid density estimator for streaming anomaly detection. In Data mining (ICDM), 2014 IEEE international conference on, IEEE, pp. 600-609. doi:10.1109/icdm.2014.45. · Zbl 1078.68728
[65] Xiong, L., Poczos, B., Schneider, J., Connolly, A., & Vander Plas, J. (2011). Hierarchical probabilistic models for group anomaly detection. In AISTATS 2011.
[66] Yamanishi, K., Takeuchi, J. I., Williams, G., & Milne, P. (2004). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3), 275-300. doi:10.1145/347090.347160. · doi:10.1145/347090.347160
[67] Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with hierarchical clusters. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 306-315). ACM. doi:10.1145/956750.956786. · Zbl 1094.68082
[68] Zhang, B., Sconyers, C., Byington, C., Patrick, R., Orchard, M. E., & Vachtsevanos, G. (2011). A probabilistic fault detection approach: Application to bearing fault detection. IEEE Transactions on Industrial Electronics, 58(5), 2011-2018. doi:10.1109/tie.2010.2058072. · doi:10.1109/tie.2010.2058072
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.