×

Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators. (English) Zbl 1477.62133

Summary: A collection of robust Mahalanobis distances for multivariate outlier detection is proposed, based on the notion of shrinkage. Robust intensity and scaling factors are optimally estimated to define the shrinkage. Some properties are investigated, such as affine equivariance and breakdown value. The performance of the proposal is illustrated through the comparison to other techniques from the literature, in a simulation study and with a real dataset. The behavior when the underlying distribution is heavy-tailed or skewed, shows the appropriateness of the method when we deviate from the common assumption of normality. The resulting high true positive rates and low false positive rates in the vast majority of cases, as well as the significantly smaller computation time show the advantages of our proposal.

MSC:

62H12 Estimation in multivariate analysis
62F35 Robustness and adaptive procedures (parametric inference)
62H05 Characterization and structure theory for multivariate probability distributions; copulas
62J07 Ridge regression; shrinkage estimators (Lasso)

Software:

AS 78; LIBRA; MNM
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Agostinelli, C.; Romanazzi, M., Local depth, J Stat Plan Inference, 141, 2, 817-830 (2011) · Zbl 1353.62019
[2] Alqallaf, F.; Van Aelst, S.; Yohai, VJ; Zamar, RH, Propagation of outliers in multivariate data, Ann Stat, 37, 1, 311-331 (2009) · Zbl 1155.62043
[3] Bay SD (1999) The UCI KDD archive [http://kdd.ics.uci.edu]. University of California, Irvine. Department of Information and Computer Science, vol 404, p 405
[4] Becker, C.; Gather, U., The masking breakdown point of multivariate outlier identification rules, J Am Stat Assoc, 94, 447, 947-955 (1999) · Zbl 1072.62600
[5] Becker, C.; Fried, R.; Kuhnt, S., Robustness and complex data structures: festschrift in honour of Ursula Gather (2014), New York: Springer, New York · Zbl 1290.62004
[6] Bose, A., Estimating the asymptotic dispersion of the l1 median, Ann Inst Stat Math, 47, 2, 267-271 (1995) · Zbl 0833.62025
[7] Bose, A.; Chaudhuri, P., On the dispersion of multivariate median, Ann Inst Stat Math, 45, 3, 541-550 (1993) · Zbl 0799.62061
[8] Brettschneider, J.; Collin, F.; Bolstad, BM; Speed, TP, Quality assessment for short oligonucleotide microarray data, Technometrics, 50, 3, 241-264 (2008)
[9] Brown, B., Statistical uses of the spatial median, J R Stat Soc Ser B (Methodol), 45, 25-30 (1983) · Zbl 0508.62046
[10] Cerioli, A.; Riani, M.; Atkinson, AC; Perrotta, D.; Torti, F., Fitting mixtures of regression lines with the forward search, Min Massive Data Sets Secur, 19, 271 (2008)
[11] Cerioli, A.; Riani, M.; Atkinson, AC, Controlling the size of multivariate outlier tests with the mcd estimator of scatter, Stat Comput, 19, 3, 341-353 (2009)
[12] Chen, SX; Qin, Y-L, A two-sample test for high-dimensional data with applications to gene-set testing, Ann Stat, 38, 2, 808-835 (2010) · Zbl 1183.62095
[13] Chen, Y.; Dang, X.; Peng, H.; Bart, HL, Outlier detection with the kernelized spatial depth function, IEEE Trans Pattern Anal Mach Intell, 31, 2, 288-305 (2009)
[14] Chen, Y.; Wiesel, A.; Hero, AO, Robust shrinkage estimation of high-dimensional covariance matrices, IEEE Trans Signal Process, 59, 9, 4097-4107 (2011) · Zbl 1391.62088
[15] Choi, HC; Edwards, HP; Sweatman, CH; Obolonkin, V., Multivariate outlier detection of dairy herd testing data, ANZIAM J, 57, 38-53 (2016)
[16] Chu, JT, On the distribution of the sample median, Ann Math Stat, 26, 112-116 (1955) · Zbl 0064.13102
[17] Couillet, R.; McKay, M., Large dimensional analysis and optimization of robust shrinkage covariance matrix estimators, J Multivar Anal, 131, 99-120 (2014) · Zbl 1306.62119
[18] DeMiguel, V.; Martin-Utrera, A.; Nogales, FJ, Size matters: optimal calibration of shrinkage estimators for portfolio selection, J Bank Finance, 37, 8, 3018-3034 (2013)
[19] Devlin, SJ; Gnanadesikan, R.; Kettenring, JR, Robust estimation of dispersion matrices and principal components, J Am Stat Assoc, 76, 374, 354-362 (1981) · Zbl 0463.62031
[20] Dodge, Y., An introduction to l1-norm based statistical data analysis, Comput Stat Data Anal, 5, 4, 239-253 (1987)
[21] Donoho, DL; Huber, PJ; Bickel, PJ; Doksum, KA; Hodges, JL Jr, The notion of breakdown point, A festschrift for Erich L. Lehmann, 157-184 (1983), Belmont: Wadsworth, Belmont · Zbl 0523.62032
[22] Falk, M., On mad and comedians, Ann Inst Stat Math, 49, 4, 615-644 (1997) · Zbl 0897.62029
[23] Filzmoser, P.; Garrett, RG; Reimann, C., Multivariate outlier detection in exploration geochemistry, Comput Geosci, 31, 5, 579-587 (2005)
[24] Gao X (2016) A flexible shrinkage operator for fussy grouped variable selection. Statistical Papers, pp 1-24
[25] Gnanadesikan, R.; Kettenring, JR, Robust estimates, residuals, and outlier detection with multiresponse data, Biometrics, 28, 81-124 (1972)
[26] Goutte C, Gaussier E (2005) A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: Proceedings of the European Conference on Information Retrieval, pp 345-359. Springer
[27] Gower, J., Algorithm as 78: the mediancentre, J R Stat Soc Ser C (Appl Stat), 23, 3, 466-470 (1974)
[28] Hall, P.; Welsh, A., Limit theorems for the median deviation, Ann Inst Stat Math, 37, 1, 27-36 (1985) · Zbl 0591.62028
[29] Hardin, J.; Rocke, DM, The distribution of robust distances, J Comput Graph Stat, 14, 4, 928-946 (2005)
[30] Hubert, M.; Debruyne, M., Breakdown value, Wiley Interdiscip Rev Comput Stat, 1, 3, 296-302 (2009)
[31] Hubert, M.; Debruyne, M., Minimum Covariance Determinant, Wiley Interdiscip Rev Comput Stat, 2, 1, 36-43 (2010)
[32] Hubert, M.; Rousseeuw, PJ; Van Aelst, S., High-breakdown robust multivariate methods, Stat Sci, 23, 92-119 (2008) · Zbl 1327.62328
[33] Inselberg, A., Parallel coordinates (2009), New York: Springer, New York · Zbl 1183.68662
[34] Inselberg A, Dimsdale B (1990) Parallel coordinates: a tool for visualizing multi-dimensional geometry. In: Proceedings of the 1st conference on Visualization’90, pp 361-378. IEEE Computer Society Press
[35] James W, Stein C (1961) Estimation with quadratic loss. In: Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, vol 1, pp 361-379 · Zbl 1281.62026
[36] Lazar, N., The statistical analysis of functional MRI data (2008), New York: Springer, New York · Zbl 1312.62004
[37] Ledoit O, Wolf M (2003a) Honey, i shrunk the sample covariance matrix. UPF economics and business working paper (691)
[38] Ledoit, O.; Wolf, M., Improved estimation of the covariance matrix of stock returns with an application to portfolio selection, J Empir Finance, 10, 5, 603-621 (2003)
[39] Ledoit, O.; Wolf, M., A well-conditioned estimator for large-dimensional covariance matrices, J Multivar Anal, 88, 2, 365-411 (2004) · Zbl 1032.62050
[40] Leroy AM, Rousseeuw PJ(1987) Robust regression and outlier detection · Zbl 0711.62030
[41] Lindquist, MA, The statistical analysis of FMRI data, Stat Sci, 23, 439-464 (2008) · Zbl 1329.62296
[42] Liu, RY, On a notion of data depth based on random simplices, Ann Stat, 18, 1, 405-414 (1990) · Zbl 0701.62063
[43] Lopuhaa, HP; Rousseeuw, PJ, Breakdown points of affine equivariant estimators of multivariate location and covariance matrices, Ann Stat, 19, 229-248 (1991) · Zbl 0733.62058
[44] Mahalanobis, PC, On the generalized distance in statistics, Proc Natl Inst Sci (Calcutta), 2, 49-55 (1936) · Zbl 0015.03302
[45] Marcano, L.; Fermín, W., Comparación de métodos de detección de datos anómalos multivariantes mediante un estudio de simulación, SABER. Revista Multidisciplinaria del Consejo de Investigación de la Universidad de Oriente, 25, 2, 192-201 (2013)
[46] Maronna RA, Yohai VJ (1976) Robust estimation of multivariate location and scatter. Statistics Reference Online, Wiley StatsRef · Zbl 1466.62158
[47] Maronna, RA; Zamar, RH, Robust estimates of location and dispersion for high-dimensional datasets, Technometrics, 44, 4, 307-317 (2002)
[48] Monti, MM, Statistical analysis of fmri time-series: a critical review of the glm approach, Front Hum Neurosci, 5, 28 (2011)
[49] Möttönen J, Nordhausen K, Oja H et al (2010) Asymptotic theory of the spatial median. In: Nonparametrics and Robustness in Modern Statistical Inference and Time Series Analysis: A Festschrift in honor of Professor Jana Jurečková, pp 182-193. Institute of Mathematical Statistics
[50] Oja, H., Multivariate nonparametric methods with R: an approach based on spatial signs and ranks (2010), New York: Springer, New York · Zbl 1269.62036
[51] Paindaveine, D.; Van Bever, G., From depth to local depth: a focus on centrality, J Am Stat Assoc, 108, 503, 1105-1119 (2013) · Zbl 06224990
[52] Peña, D.; Prieto, FJ, Multivariate outlier detection and robust covariance matrix estimation, Technometrics, 43, 3, 286-310 (2001)
[53] Peña, D.; Prieto, FJ, Combining random and specific directions for outlier detection and robust estimation in high-dimensional multivariate data, J Comput Graph Stat, 16, 1, 228-254 (2007)
[54] Perrotta D, Torti F (2010) Detecting price outliers in european trade data with the forward search. In: Data Analysis and Classification, pp 415-423. Springer
[55] Poline, J-B; Brett, M., The general linear model and fmri: does love last forever?, Neuroimage, 62, 2, 871-880 (2012)
[56] Powers DM (2011) Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation
[57] Reimann, C.; Filzmoser, P., Normal and lognormal data distribution in geochemistry: death of a myth. consequences for the statistical treatment of geochemical and environmental data., Environ Geol, 39, 9, 1001-1014 (2000)
[58] Rousseeuw, PJ, Multivariate estimation with high breakdown point, Math Stat Appl, 8, 283-297 (1985) · Zbl 0609.62054
[59] Rousseeuw, PJ; Driessen, KV, A fast algorithm for the minimum covariance determinant estimator, Technometrics, 41, 3, 212-223 (1999)
[60] Rousseeuw, PJ; Van Zomeren, BC, Unmasking multivariate outliers and leverage points, J Am Stat Assoc, 85, 411, 633-639 (1990)
[61] Sajesh, T.; Srinivasan, M., Outlier detection for high dimensional data using the comedian approach, J Stat Comput Simul, 82, 5, 745-757 (2012) · Zbl 1432.62164
[62] Serfling R (2002) A depth function and a scale curve based on spatial quantiles. In: Statistical data analysis based on the L1-norm and related methods, pp 25-38. Springer, New York · Zbl 1460.62076
[63] Small, CG, A survey of multidimensional medians, Int Stat Rev, 58, 263-277 (1990)
[64] Sokolova M, Japkowicz N, Szpakowicz S (2006) Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation. In: Australasian Joint Conference on Artificial Intelligence, pp 1015-1021. Springer, New York
[65] Steland A (2018) Shrinkage for covariance estimation: asymptotics, confidence intervals, bounds and applications in sensor monitoring and finance. Statistical Papers, pp 1-22 · Zbl 1408.62178
[66] Sun R, Ma T, Liu S (2018) Portfolio selection: shrinking the time-varying inverse conditional covariance matrix. Statistical Papers, pp 1-22
[67] Sun, Y.; Genton, MG, Functional boxplots, J Comput Graph Stat, 20, 2, 316-334 (2011)
[68] Tarr, G.; Müller, S.; Weber, NC, Robust estimation of precision matrices under cellwise contamination, Comput Stat Data Anal, 93, 404-420 (2016) · Zbl 1468.62192
[69] Templ, M.; Filzmoser, P.; Reimann, C., Cluster analysis applied to regional geochemical data: problems and possibilities, Appl Geochem, 23, 8, 2198-2213 (2008)
[70] Tukey, JW, Mathematics and the picturing of data, Proc Int Congr Math, 2, 523-531 (1975) · Zbl 0347.62002
[71] Vardi, Y.; Zhang, C-H, The multivariate l1-median and associated data depth, Proc Natl Acad Sci USA, 97, 4, 1423-1426 (2000) · Zbl 1054.62067
[72] Vargas, JA; Robust, N., estimation in multivariate control charts for individual observations, J Qual Technol, 35, 4, 367-376 (2003)
[73] Verboven, S.; Hubert, M., Libra: a matlab library for robust analysis, Chemometr Intell Lab Syst, 75, 2, 127-136 (2005)
[74] Wegman, EJ, Hyperdimensional data analysis using parallel coordinates, J Am Stat Assoc, 85, 411, 664-675 (1990)
[75] Zeng, Y.; Wang, G.; Yang, E.; Ji, G.; Brinkmeyer-Langford, CL; Cai, JJ, Aberrant gene expression in humans, PLoS Genet, 11, 1, e1004942 (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.