×

zbMATH — the first resource for mathematics

Cluster-based outlier detection. (English) Zbl 1183.68212
Summary: Outlier detection has important applications in the field of data mining, such as fraud detection, customer behavior analysis, and intrusion detection. Outlier detection is the process of detecting the data objects which are grossly different from or inconsistent with the remaining set of data. Outliers are traditionally considered as single points; however, there is a key observation that many abnormal events have both temporal and spatial locality, which might form small clusters that also need to be deemed as outliers. In other words, not only a single point but also a small cluster can probably be an outlier. In this paper, we present a new definition for outliers: cluster-based outlier, which is meaningful and provides importance to the local data behavior, and how to detect outliers by the clustering algorithm LDBSCAN which is capable of finding clusters and assigning LOF to single points.

MSC:
68P05 Data structures
68T10 Pattern recognition, speech recognition
68P15 Database theory
Software:
LOF
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Record, 27(2), 94–105. doi: 10.1145/276305.276314 . · doi:10.1145/276305.276314
[2] Ankerst, M., Breunig, M. M., Kriegel, H., & Sander, J. (1999). OPTICS: ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD international conference on management of data (pp. 49–60). SIGMOD’99, Philadelphia, Pennsylvania, United States, May 31–June 03, 1999. New York: ACM Press.
[3] Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley. · Zbl 0801.62001
[4] Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is ”nearest neighbor” meaningful? In C. Beeri & P. Buneman (Eds.), Lecture notes in computer science: Vol. 1540. Proceeding of the 7th international conference on database theory (pp. 217–235). January 10–12, 1999. London: Springer.
[5] Breunig, M. M., Kriegel, H., Ng, R. T., & Sander, J. (2000). LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 93–104). SIGMOD’00, Dallas, Texas, United States, May 15–18, 2000. New York: ACM Press.
[6] Carvalho, R., & Costa, H. (2007). Application of an integrated decision support process for supplier selection. Enterprise Information Systems, 1(2), 197–216. doi: 10.1080/17517570701356208 . · doi:10.1080/17517570701356208
[7] Crovella, M. E., & Bestavros, A. (1997). Self-similarity in World Wide Web traffic: evidence and possible causes. IEEE/ACM Transactions on Networking, 5(6), 835–846. · doi:10.1109/90.650143
[8] Duan, L., Xu, L., Guo, F., Lee, J., & Yan, B. (2007). A local-density based spatial clustering algorithm with noise. Information Systems, 32(7), 978–986. doi: 10.1016/j.is.2006.10.006 . · Zbl 05184335 · doi:10.1016/j.is.2006.10.006
[9] Ester, M., Kriegel, H., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noises. In Proc. 2nd int. conf. on knowledge discovery and data mining (pp. 226–231). AAAI Press: Portland.
[10] Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In A. Tiwary & M. Franklin (Eds.), Proceedings of the 1998 ACM SIGMOD international conference on management of data (pp. 73–84). SIGMOD’98 Seattle, Washington, United States, June 01–04, 1998. New York: ACM Press. · Zbl 1006.68661
[11] Han, J., & Kamber, M. (2006). Data mining: concepts and techniques. Amsterdam: Elsevier. · Zbl 1445.68004
[12] Hawkins, D. (1980). Identification of outliers. London: Chapman and Hall. · Zbl 0438.62022
[13] He, Z., Xu, X., & Deng, S. (2003). Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9–10), 1641–1650. doi: 10.1016/S0167-8655(02)00160-5 . · Zbl 1048.68084 · doi:10.1016/S0167-8655(03)00003-5
[14] Hinneburg, A., & Keim, D. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proc. 4th int. conf. on knowledge discovery and data mining (pp. 58–65). New York.
[15] Hinneburg, A., Aggarwal, C. C., & Keim, D. A. (2000). What is the nearest neighbor in high dimensional spaces? In A. E. Abbadi, M. L. Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, & K. Whang (Eds.), Proceedings of the 26th international conference on very large data bases (pp. 506–515). Very large data bases, September 10–14, 2000. San Francisco: Morgan Kaufmann Publishers.
[16] Hsu, C., & Wallace, W. A. (2007). An industrial network flow information integration model for supply chain management and intelligent transportation. Enterprise Information Systems, 1(3), 327–351. doi: 10.1080/17517570701504633 . · doi:10.1080/17517570701504633
[17] Jiang, M. F., Tseng, S. S., & Su, C. M. (2001). Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6–7), 691–700. · Zbl 1010.68908 · doi:10.1016/S0167-8655(00)00131-8
[18] Johnson, T., Kwok, I., & Ng, R. (1998). Fast computation of 2-dimensional depth contours. In Proc. 4th int. conf. on knowledge discovery and data mining (pp. 224–228). New York: AAAI Press.
[19] Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In A. Gupta, O. Shmueli, & J. Widom (Eds.), Proceedings of the 24rd international conference on very large data bases (pp. 392–403). Very large data bases, August 24–27, 1998. San Francisco: Morgan Kaufmann Publishers.
[20] Knorr, E. M., & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, & M. L. Brodie (Eds.), Proceedings of the 25th international conference on very large data bases (pp. 211–222). Very large data bases, September 07–10, 1999. San Francisco: Morgan Kaufmann Publishers.
[21] Li, H., & Xu, L. (2001). Feature space theory–a mathematical foundation for data mining. Knowledge-Based Systems, 14(5–6), 253–257. doi: 10.1016/S0950-7051(01)00103-4 . · doi:10.1016/S0950-7051(01)00103-4
[22] Li, H., Xu, L., Wang, J., & Mo, Z. (2003). Feature space theory in data mining: transformations between extensions and intensions in knowledge representation. Expert Systems, 20(2), 60–71. doi: 10.1111/1468-0394.00226 . · Zbl 05653426 · doi:10.1111/1468-0394.00226
[23] Luo, J., Xu, L., Jamont, J., Zeng, L., & Shi, Z. (2007). Flood decision support system on agent grid: method and implementation. Enterprise Information Systems, 1(1), 49–68. doi: 10.1080/17517570601092184 . · doi:10.1080/17517570601092184
[24] Ng, R., & Han, J. (2002). CLARANS: a method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering, 14(5), 1003–1016. · Zbl 05109550 · doi:10.1109/TKDE.2002.1033770
[25] Preparata, F., & Shamos, M. (1988). Computational geometry: an introduction. Berlin: Springer. · Zbl 0575.68059
[26] Qiu, G., Li, H., Xu, L., & Zhang, W. (2003). A knowledge processing method for intelligent systems based on inclusion degree. Expert Systems, 20(4), 187–195. doi: 10.1111/1468-0394.00243 . · Zbl 05653438 · doi:10.1111/1468-0394.00243
[27] Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 427–438). SIGMOD’00, Dallas, Texas, United States, May 15–18, 2000. New York: ACM Press.
[28] Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). WaveCluster: a multi-resolution clustering approach for very large spatial databases. In A. Gupta, O. Shmueli, & J. Widom (Eds.), Proceedings of the 24rd international conference on very large data bases (pp. 428–439). Very large data bases, August 24–27, 1998. San Francisco: Morgan Kaufmann Publishers.
[29] Shi, Z., Huang, Y., He, Q., Xu, L., Liu, S., Qin, L., Jia, Z., Li, J., Huang, H., & Zhao, L. (2007). MSMiner-a developing platform for OLAP. Decision Support Systems, 42(4), 2016–2028. doi: 10.1016/j.dss.2004.11.006 . · doi:10.1016/j.dss.2004.11.006
[30] Tukey, J. W. (1977). Exploratory data analysis. Reading: Addison–Wesley. · Zbl 0409.62003
[31] Wang, W., Yang, J., & Muntz, R. R. (1997). STING: a statistical information grid approach to spatial data mining. In M. Jarke, M. J. Carey, K. R. Dittrich, F. H. Lochovsky, P. Loucopoulos, & M. A. Jeusfeld (Eds.), Proceedings of the 23rd international conference on very large data bases (pp. 186–195). Very large data bases, August 25–29, 1997. San Francisco: Morgan Kaufmann Publishers.
[32] Xu, L. (2006). Advances in intelligent information processing. Expert Systems, 23(5), 249–250. doi: 10.1111/j.1468-0394.2006.00405.x . · doi:10.1111/j.1468-0394.2006.00405.x
[33] Xu, L., Liang, N., & Gao, Q. (2008). An integrated approach for agricultural ecosystem management, IEEE Transactions on Systems Man and Cybernetics, Part C, 38(3).
[34] Zhang, M., Xu, L., Zhang, W., & Li, H. (2003). A rough set approach to knowledge reduction based on inclusion degree and evidence reasoning theory. Expert Systems, 20(5), 298–304. doi: 10.1111/1468-0394.00254 . · Zbl 05653446 · doi:10.1111/1468-0394.00254
[35] Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. In J. Widom (Ed.), Proceedings of the 1996 ACM SIGMOD international conference on management of data (pp. 103–114). SIGMOD’96 Montreal, Quebec, Canada, June 04–06, 1996. New York: ACM Press.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.