×

On normalization and algorithm selection for unsupervised outlier detection. (English) Zbl 1464.62281

Summary: This paper demonstrates that the performance of various outlier detection methods is sensitive to both the characteristics of the dataset, and the data normalization scheme employed. To understand these dependencies, we formally prove that normalization affects the nearest neighbor structure, and density of the dataset; hence, affecting which observations could be considered outliers. Then, we perform an instance space analysis of combinations of normalization and detection methods. Such analysis enables the visualization of the strengths and weaknesses of these combinations. Moreover, we gain insights into which method combination might obtain the best performance for a given dataset.

MSC:

62G32 Statistics of extreme values; tail inference
62H25 Factor analysis and principal components; correspondence analysis
60F10 Large deviations
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Achtert E, Kriegel H-P, Zimek A (2008) Elki: a software system for evaluation of subspace clustering algorithms. In: International conference on scientific and statistical database management. Springer, pp 580-585
[2] Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery. Springer, pp 15-27 · Zbl 1020.68527
[3] Barnett, V.; Lewis, T., Outliers in statistical data (1974), Hoboken: Wiley, Hoboken
[4] Bates, D.; Mächler, M.; Bolker, B.; Walker, S., Fitting linear mixed-effects models using lme4, J Stat Softw, 67, 1, 1-48 (2015) · doi:10.18637/jss.v067.i01
[5] Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S (2018) FNN: Fast nearest neighbor search algorithms and applications. R package version 1.1.2.2. https://CRAN.R-project.org/package=FNN
[6] Billor, N.; Hadi, As; Velleman, Pf, Bacon: blocked adaptive computationally efficient outlier nominators, Comput Stat Data Anal, 34, 3, 279-298 (2000) · Zbl 1145.62314 · doi:10.1016/S0167-9473(99)00101-2
[7] Bischl B, Mersmann O, Trautmann H, Preuß M (2012) Algorithm selection based on exploratory landscape analysis and cost-sensitive learning. In: Proceedings of the 14th annual conference on genetic and evolutionary computation. ACM, pp 313-320
[8] Brazdil, P.; Giraud-Carrier, C.; Soares, C.; Vilalta, R., Metalearning: applications to data mining (2008), Berlin: Springer, Berlin · Zbl 1173.68625
[9] Breheny, P.; Burchett, W., Visualization of regression models using visreg, R J, 9, 2, 56-71 (2017) · doi:10.32614/RJ-2017-046
[10] Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: ACM sigmod record, vol 29. ACM, pp 93-104
[11] Campos, Go; Zimek, A.; Sander, J.; Campello, Rj; Micenková, B.; Schubert, E.; Assent, I.; Houle, Me, On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study, Data Min Knowl Discov, 30, 4, 891-927 (2016) · doi:10.1007/s10618-015-0444-8
[12] Craswell, N., Precision at n, 2127-2128 (2009), Boston: Springer, Boston
[13] Csardi, G.; Nepusz, T., The igraph software package for complex network research, InterJ Complex Syst, 1695, 5, 1-9 (2006)
[14] Culberson, Jc, On the futility of blind search: an algorithmic view of “no free lunch”, Evol Comput, 6, 2, 109-127 (1998) · doi:10.1162/evco.1998.6.2.109
[15] Davis J, Goadrich M (2006) The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd international conference on machine learning. ACM, pp 233-240
[16] Duong T (2018) ks: Kernel smoothing. R package version 1.11.3. https://CRAN.R-project.org/package=ks
[17] Emmott A, Das S, Dietterich T, Fern A, Wong W-K (2015) A meta-analysis of the anomaly detection problem. ArXiv preprint arXiv:1503.01158
[18] Emmott AF, Das S, Dietterich T, Fern A, Wong W-K (2013) Systematic construction of anomaly detection benchmarks from real data. In: Proceedings of the ACM SIGKDD workshop on outlier detection and description. ACM, pp 16-21
[19] Goix N (2016) How to evaluate the quality of unsupervised anomaly detection algorithms? arXiv preprint arXiv:1607.01152
[20] Goldstein, M.; Uchida, S., A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data, PLoS ONE, 11, 4, e0152173 (2016) · doi:10.1371/journal.pone.0152173
[21] Hahsler M, Piekenbrock M (2018) dbscan: Density based clustering of applications with noise (DBSCAN) and related algorithms. R package version 1.1-3. https://CRAN.R-project.org/package=dbscan
[22] Hautamaki V, Karkkainen I, Franti P (2004) Outlier detection using k-nearest neighbour graph. In: Proceedings of the 17th international conference on pattern recognition, ICPR 2004, vol 3. IEEE, pp 430-433
[23] Hawkins, Dm, Identification of outliers (1980), Berlin: Springer, Berlin · Zbl 0438.62022
[24] Ho, Y-C; Pepyne, Dl, Simple explanation of the no-free-lunch theorem and its implications, J Optim Theory Appl, 115, 3, 549-570 (2002) · Zbl 1031.91018 · doi:10.1023/A:1021251113462
[25] Hothorn, T.; Bretz, F.; Westfall, P., Simultaneous inference in general parametric models, Biom J, 50, 3, 346-363 (2008) · Zbl 1442.62415 · doi:10.1002/bimj.200810425
[26] Hubert, M.; Van Der Veeken, S., Outlier detection for skewed data, J Chemom, 22, 3-4, 235-246 (2008) · doi:10.1002/cem.1123
[27] Igel, C.; Toussaint, M., A no-free-lunch theorem for non-uniform distributions of target functions, J Math Modell Algorithms, 3, 4, 313-322 (2005) · Zbl 1079.90111 · doi:10.1007/s10852-005-2586-y
[28] Jin W, Tung AK, Han J, Wang W (2006) Ranking outliers using symmetric neighborhood relationship. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 577-593
[29] Kandanaarachchi S (2018) Outselect: algorithm selection for unsupervised outlier detection. R package version 0.0.0.9000. https://github.com/sevvandi/outselect
[30] Kandanaarachchi S, Munoz MA, Smith-Miles K (2019) Instance space analysis for unsupervised outlier detection. In: Proceedings of the 1st workshop on evaluation and experimental design in data mining and machine learning co-located with siam international conference on data mining (SDM 2019), Calgary, Alberta, Canada, May 4th, 2019, pp 32-41. http://ceur-ws.org/Vol-2436/article_4.pdf
[31] Kandanaarachchi S, Muñoz MA, Smith-Miles K, Hyndman R (2019) Datasets for outlier detection. https://monash.figshare.com/articles/Datasets_12338_zip/7705127/4
[32] Kang, Y.; Hyndman, R.; Smith-Miles, K., Visualising forecasting algorithm performance using time series instance spaces, Int J Forecast, 33, 2, 345-358 (2017) · doi:10.1016/j.ijforecast.2016.09.004
[33] Komsta L, Novomestky F (2015) Moments: moments, cumulants, skewness, kurtosis and related tests. R package version 0.14. https://CRAN.R-project.org/package=moments
[34] Kourentzes N (2019) tsutils: time series exploration, modelling and forecasting. R package version 0.9.0. https://CRAN.R-project.org/package=tsutils
[35] Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) LoOP: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management. ACM, pp 1649-1652
[36] Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 444-452
[37] Latecki LJ, Lazarevic A, Pokrajac D (2007) Outlier detection with kernel density functions. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 61-75
[38] Leigh, C.; Alsibai, O.; Hyndman, Rj; Kandanaarachchi, S.; King, Oc; Mcgree, Jm; Neelamraju, C.; Strauss, J.; Talagala, Pd; Turner, Rd, A framework for automated anomaly detection in high frequency water-quality data from in situ sensors, Sci Total Environ, 664, 885-898 (2019) · doi:10.1016/j.scitotenv.2019.02.085
[39] Leyton-Brown K, Nudelman E, Andrew G, McFadden J, Shoham Y (2003) A portfolio approach to algorithm selection. In: 2003 International joint conference on artificial intelligence (IJCAI), vol 3. pp 1542-1543
[40] Liaw, A.; Wiener, M., Classification and regression by randomForest, R News, 2, 3, 18-22 (2002)
[41] Liu FT (2009) Isolationforest: Isolation forest. R package version 0.0-26/r4. https://R-Forge.R-project.org/projects/iforest/
[42] Liu FT, Ting KM, Zhou Z-H (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 413-422
[43] Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2018) Cluster: cluster analysis basics and extensions. R package version 2.0.7-1
[44] Meschiari S (2015) latex2exp: Use LaTeX expressions in plots. R package version 0.4.0. https://CRAN.R-project.org/package=latex2exp
[45] Meyer D, Dimitriadou E, Hornik K, Weingessel A, Leisch F (2018) e1071: Misc functions of the department of statistics, probability theory group (formerly: E1071), TU Wien. R package version 1.7-0. https://CRAN.R-project.org/package=e1071
[46] Meyer PE (2014) Infotheo: information-theoretic measures. R package version 1.2.0. https://CRAN.R-project.org/package=infotheo
[47] Muñoz MA (2019) Instance space analysis: a toolkit for the assessment of algorithmic power. https://github.com/andremun/InstanceSpace
[48] Muñoz, Ma; Villanova, L.; Baatar, D.; Smith-Miles, K., Instance spaces for machine learning classification, Mach Learn, 107, 1, 109-147 (2018) · Zbl 1457.68235 · doi:10.1007/s10994-017-5629-5
[49] Peng Y, Flach PA, Soares C, Brazdil P (2002) Improved dataset characterisation for meta-learning. In: International conference on discovery science. Springer, pp 141-152 · Zbl 1024.68579
[50] Pfahringer B, Bensusan H, Giraud-Carrier CG (2000) Meta-learning by landmarking various learning algorithms. In: International conference on machine learning (ICML), pp 743-750
[51] Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: ACM sigmod record, vol 29. ACM, pp 427-438
[52] Rice J (1976) The algorithm selection problem. In: Advances in computers, vol 15. Elsevier, pp 65-118
[53] Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J-C; Müller, M., pROC: an open-source package for R and S+ to analyze and compare ROC curves, BMC Bioinform, 12, 77 (2011) · doi:10.1186/1471-2105-12-77
[54] Rousseeuw, Pj; Hubert, M., Anomaly detection by robust statistics, Wiley Interdiscip Rev Data Min Knowl Discov, 8, e1236 (2017) · doi:10.1002/widm.1236
[55] Ryan JA, Ulrich JM (2018) quantmod: Quantitative financial modelling framework. R package version 0.4-13. https://CRAN.R-project.org/package=quantmod
[56] Schubert E, Zimek A, Kriegel H-P (2014a) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, pp 542-550
[57] Schubert, E.; Zimek, A.; Kriegel, H-P, ‘Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection’, Data Min Knowl Discov, 28, 1, 190-237 (2014) · Zbl 1281.68192 · doi:10.1007/s10618-012-0300-z
[58] Smith-Miles K (2019) MATILDA: melbourne algorithm test instance library with data analytics. https://matilda.unimelb.edu.au
[59] Smith-Miles, Ka, Cross-disciplinary perspectives on meta-learning for algorithm selection, ACM Comput Surv (CSUR), 41, 1, 6 (2009)
[60] Smith-Miles, K.; Baatar, D.; Wreford, B.; Lewis, R., Towards objective measures of algorithm performance across instance space, Comput Oper Res, 45, 12-24 (2014) · Zbl 1348.90646 · doi:10.1016/j.cor.2013.11.015
[61] Smith-Miles, K.; Bowly, S., Generating new test instances by evolving in instance space, Comput Oper Res, 63, 102-113 (2015) · Zbl 1349.68325 · doi:10.1016/j.cor.2015.04.022
[62] Smith-Miles K, Tan TT (2012) Measuring algorithm footprints in instance space. In: 2012 IEEE congress on evolutionary computation. IEEE, pp 3446-3453
[63] Talagala, Pd; Hyndman, Rj; Smith-Miles, K.; Kandanaarachchi, S.; Munoz, Ma, Anomaly detection in streaming nonstationary temporal data, J Comput Graph Stat (2019) · Zbl 07499268 · doi:10.1080/10618600.2019.1617160
[64] Tang J, Chen Z, Fu AW-C, Cheung DW (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 535-548 · Zbl 1048.68925
[65] Wickham, H., Reshaping data with the reshape package, J Stat Softw, 21, 12, 1-20 (2007) · doi:10.18637/jss.v021.i12
[66] Wickham H (2016) ggplot2: Elegant graphics for data analysis. Springer, New York. http://ggplot2.org · Zbl 1397.62006
[67] Wilkinson, L., Visualizing big data outliers through distributed aggregation, IEEE Trans Vis Comput Graph, 24, 1, 256-266 (2018) · doi:10.1109/TVCG.2017.2744685
[68] Wolpert, Dh; Macready, Wg, No free lunch theorems for optimization, IEEE Trans Evolut Comput, 1, 1, 67-82 (1997) · doi:10.1109/4235.585893
[69] Wolpert DH, Macready WG et al (1995) No free lunch theorems for search. Technical report, SFI-TR-95-02-010, Santa Fe Institute
[70] Zhang E, Zhang Y (2009) Average precision. In: Encyclopedia of database systems. Springer, Berlin, pp 192-193
[71] Zhang K, Hutter M, Jin H (2009) A new local distance-based outlier detection approach for scattered real-world data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer, pp 813-822
[72] Zimek, A.; Schubert, E.; Kriegel, H-P, A survey on unsupervised outlier detection in high-dimensional numerical data, Stat Anal Data Min ASA Data Sci J, 5, 5, 363-387 (2012) · Zbl 07260336 · doi:10.1002/sam.11161
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.