×

Robust clustering around regression lines with high density regions. (English) Zbl 1474.62217

Summary: Robust methods are needed to fit regression lines when outliers are present. In a clustering framework, outliers can be extreme observations, high leverage points, but also data points which lie among the groups. Outliers are also of paramount importance in the analysis of international trade data, which motivate our work, because they may provide information about anomalies like fraudulent transactions. In this paper we show that robust techniques can fail when a large proportion of non-contaminated observations fall in a small region, which is a likely occurrence in many international trade data sets. In such instances, the effect of a high-density region is so strong that it can override the benefits of trimming and other robust devices. We propose to solve the problem by sampling a much smaller subset of observations which preserves the cluster structure and retains the main outliers of the original data set. This goal is achieved by defining the retention probability of each point as an inverse function of the estimated density function for the whole data set. We motivate our proposal as a thinning operation on a point pattern generated by different components. We then apply robust clustering methods to the thinned data set for the purposes of classification and outlier detection. We show the advantages of our method both in empirical applications to international trade examples and through a simulation study.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62H11 Directional data; spatial statistics
62G35 Nonparametric robustness
62G07 Density estimation
62G08 Nonparametric regression and quantile regression
62P20 Applications of statistics to economics
91B60 Trade models

Software:

spatstat; TCLUST
PDF BibTeX XML Cite
Full Text: DOI

References:

[1] Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer, New York · Zbl 1049.62057
[2] Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis. J Korean Stat Soc 39:117-134 · Zbl 1294.62149
[3] Baddeley A, Turner R (2012) Package ‘spatstat’: spatial point pattern analysis, model-fitting, simulation, tests. http://www.cran.r-project.org/web/packages/spatstat/spatstat.pdf
[4] Bai X, Yao W, Boyer JE (2012) Robust fitting of mixture regression models. Comput Stat Data Anal 56:2347-2359 · Zbl 1252.62011
[5] Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93:577-584 · Zbl 0926.62089
[6] Coretto P, Hennig C (2010) A simulation study to compare robust clustering methods based on mixtures. Adv Data Anal Classif 4:111-135 · Zbl 1284.62366
[7] Dasgupta A, Raftery AE (1998) Detecting features in spatial point processes with clutter via model-based clustering. J Am Stat Assoc 93:294-302 · Zbl 0906.62105
[8] De Battisti F, Salini S (2013) Robust analysis of bibliometric data. Stat Methods Appl 22:269-283 · Zbl 1333.62012
[9] Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34:138-147 · Zbl 0584.62140
[10] FATF-OECD, Financial Action Task Force (2006) Trade based money laundering. http://www.fatf-gafi.org/
[11] FATF-OECD, Financial Action Task Force (2008) Best practices on trade based money laundering. http://www.fatf-gafi.org/
[12] Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97:611-631 · Zbl 1073.62545
[13] Fritz H, Garcìa-Escudero LA, Mayo-Iscar A (2012) tclust: an R package for a trimming approach to Cluster Analysis. J Stat Softw 47.
[14] Garcìa-Escudero LA, Gordaliza A, Van Aelst S, Zamar R (2009) Robust linear clustering. J R Stat Soc B 71:301-319 · Zbl 1231.62112
[15] Garcìa-Escudero LA, Gordaliza A, Matrán C, Mayo-Iscar A (2010a) A review of robust clustering methods. Adv Data Anal Classif 4:89-109 · Zbl 1284.62375
[16] Garcìa-Escudero LA, Gordaliza A, Mayo-Iscar A (2010b) Robust clusterwise linear regression through trimming. Comput Stat Data Anal 54:3057-3069 · Zbl 1284.62198
[17] Heikkonen, J.; Perrotta, D.; Riani, M.; Torti, F.; Giusti, A. (ed.); Ritter, G. (ed.); Vichi, M. (ed.), Issues on clustering and data gridding, 37-44 (2013), Berlin
[18] Illian J, Penttinen A, Stoyan H, Stoyan D (2008) Statistical analysis and modelling of spatial point patterns. Wiley, Chichester · Zbl 1197.62135
[19] Neykov N, Filzmoser P, Dimova R, Neytchev P (2007) Robust fitting of mixtures using the trimmed likelihood estimator. Comput Stat Data Anal 52:299-308 · Zbl 1328.62033
[20] Riani, M.; Atkinson, AC; Cerioli, A.; etal.; Ciaccio, A. (ed.), Problems and challenges in the analysis of complex data: static and dynamic approaches, 145-157 (2012), Berlin
[21] Riani, M.; Cerioli, A.; Atkinson, AC; Perrotta, D.; Torti, F.; etal.; Fogelman-Soulié, F. (ed.), Fitting mixtures of regression lines with the forward search, 271-286 (2008), Amsterdam
[22] Rocci R, Gattone SA, Vichi M (2009) A new dimension reduction method: factor discriminant K-means. J Classif 28:210-226 · Zbl 1226.62062
[23] Van Aelst S, Wang X, Zamar R, Zhu R (2006) Linear grouping using orthogonal regression. Comput Stat Data Anal 50:1287-1312 · Zbl 1431.62273
[24] Vichi M, Rocci R, Kiers HAL (2007) Simultaneous component and clustering models for three-way data: within and between approaches. J Classif 24:71-98 · Zbl 1144.62045
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.