×

The forward search: theory and data analysis. (English) Zbl 1294.62149

Summary: The forward search is a powerful general method, incorporating flexible data-driven trimming, for the detection of outliers and unsuspected structure in data and so for building robust models. Starting from small subsets of data, observations that are close to the fitted model are added to the observations used in parameter estimation. As this subset grows we monitor parameter estimates, test statistics and measures of fit such as residuals. The paper surveys theoretical development in work on the Forward Search over the last decade. The main illustration is a regression example with 330 observations and 9 potential explanatory variables. Mention is also made of procedures for multivariate data, including clustering, time series analysis and fraud detection.

MSC:

62J05 Linear regression; mixed models
62H12 Estimation in multivariate analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62M10 Time series, auto-correlation, regression, etc. in statistics (GARCH)
62P12 Applications of statistics to environmental and related topics
62-02 Research exposition (monographs, survey articles) pertaining to statistics
62-07 Data analysis (statistics) (MSC2010)
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Andrews, D. F.; Bickel, P. J.; Hampel, F. R.; Tukey, W. J.; Huber, P. J., Robust estimates of location: survey and advances (1972), Princeton University Press: Princeton University Press Princeton, NJ · Zbl 0254.62001
[2] Atkinson, A. C., Testing transformations to normality, Journal of the Royal Statistical Society, Series B, 35, 473-479 (1973) · Zbl 0289.62047
[3] Atkinson, A. C., Plots, transformations, and regression (1985), Oxford University Press: Oxford University Press Oxford · Zbl 0582.62065
[4] Atkinson, A. C., Fast very robust methods for the detection of multiple outliers, Journal of the American Statistical Association, 89, 1329-1339 (1994) · Zbl 0825.62429
[5] Atkinson, A. C., Econometric applications of the forward search in regression: robustness, diagnostics and graphics, Econometric Reviews, 28, 21-39 (2009) · Zbl 1161.62446
[6] Atkinson, A. C.; Riani, M., Robust diagnostic regression analysis (2000), Springer-Verlag: Springer-Verlag New York · Zbl 0964.62063
[7] Atkinson, A. C.; Riani, M., Tests in the fan plot for robust, diagnostic transformations in regression, Chemometrics and Intelligent Laboratory Systems, 60, 87-100 (2002)
[8] Atkinson, A. C.; Riani, M., Distribution theory and simulations for tests of outliers in regression, Journal of Computational and Graphical Statistics, 15, 460-476 (2006)
[9] Atkinson, A. C.; Riani, M., Building regression models with the forward search, Journal of Computing and Information Technology—CIT, 15, 287-294 (2007)
[10] Atkinson, A. C.; Riani, M., Exploratory tools for clustering multivariate data, Computational Statistics and Data Analysis, 52, 272-285 (2007) · Zbl 1452.62028
[11] Atkinson, A. C.; Riani, M., A robust and diagnostic information criterion for selecting regression models, Journal of the Japanese Statistical Society, 38, 3-14 (2008)
[12] Atkinson, A. C.; Riani, M.; Cerioli, A., Exploring multivariate data with the forward search (2004), Springer-Verlag: Springer-Verlag New York · Zbl 1049.62057
[13] Atkinson, A. C.; Riani, M.; Cerioli, A., Random start forward searches with envelopes for detecting clusters in multivariate data, (Zani, S.; Cerioli, A.; Riani, M.; Vichi, M., Data analysis, classification and the forward search (2006), Springer-Verlag: Springer-Verlag Berlin), 163-171
[14] Atkinson, A. C.; Riani, M.; Cerioli, A., Monitoring random start forward searches for multivariate data, (Brito, P., COMPSTAT 2008 (2008), Physica-Verlag: Physica-Verlag Heidelberg), 447-458 · Zbl 1153.65307
[15] Barnett, V.; Lewis, T., Outliers in statistical data (1994), Wiley: Wiley New York · Zbl 0801.62001
[16] Becker, C.; Gather, U., The masking breakdown point of multivariate outlier identification rules, Journal of the American Statistical Association, 94, 947-955 (1999) · Zbl 1072.62600
[17] Beckman, R. J.; Cook, R. D., Outlier..........s (with discussion), Technometrics, 25, 119-163 (1983) · Zbl 0514.62041
[18] Box, G. E.P., Non-normality and tests on variances, Biometrika, 40, 318-335 (1953) · Zbl 0051.10805
[19] Box, G. E.P.; Cox, D. R., An analysis of transformations (with discussion), Journal of the Royal Statistical Society, Series B, 26, 211-246 (1964) · Zbl 0156.40104
[20] Box, G. E.P.; Watson, G. S., Robustness to non-normality of regression tests, Biometrika, 49, 93-106 (1962) · Zbl 0113.34901
[21] Breiman, L.; Friedman, J. H., Estimating optimal transformations for multiple regression and transformation (with discussion), Journal of the American Statistical Association, 80, 580-619 (1985) · Zbl 0594.62044
[22] Casella, G.; Berger, R. L., Statstical inference (2002), Duxbury: Duxbury Pacific Grove
[25] Cerioli, A.; Riani, M., Robust methods for the analysis of spatially autocorrelated data, Statistical Methods and Applications—Journal of the Italian Statistical Society, 11, 335-358 (2002) · Zbl 1145.62375
[26] Cerioli, A.; Riani, M.; Atkinson, A. C., Controlling the size of multivariate outlier tests with the MCD estimator of scatter, Statistics and Computing, 19, 341-353 (2009)
[27] Chen, C.; Liu, L.-M., Joint estimation of model parameters and outlier effects in time series, Journal of the American Statistical Association, 88, 284-297 (1993) · Zbl 0775.62229
[28] Cheng, T.-C.; Biswas, A., Maximum trimmed likelihood estimator for multivariate mixed continuous and categorical data, Computational Statistics and Data Analysis, 52, 2042-2065 (2008) · Zbl 1452.62368
[29] Cook, R. D.; Weisberg, S., Residuals and influence in regression (1982), Chapman and Hall: Chapman and Hall London · Zbl 0564.62054
[30] Crosilla, F.; Visentini, D. F.S., An automatic classification and robust segmentation procedure of spatial objects, Statistical Methods and Applications, 15, 329-341 (2007)
[31] de Jong, P.; Penzer, J., Diagnosing shocks in time series, Journal of the American Statistical Association, 93, 796-806 (1998) · Zbl 0926.62079
[32] Deng, D.; Joseph, V. R.; Sudjianto, A.; Wu, C. F.J., Active learning through sequential design, with applications to detection of money laundering, Journal of the American Statistical Association, 104, 969-981 (2009) · Zbl 1388.62234
[33] Forbes, J. D., Further experiments and remarks on the measurement of heights by the boiling point of water, Transactions of the Royal Society of Edinburgh, 21, 235-243 (1857)
[34] Fraley, C.; Raftery, A. E., Enhanced model-based clustering, density estimation and discriminant analysis: MCLUST, Journal of Classification, 20, 263-286 (2003) · Zbl 1055.62071
[35] García-Escudero, L. A.; Gordaliza, A., Generalized radius processes for elliptically contoured distributions, Journal of the American Statistical Association, 100, 1036-1045 (2005) · Zbl 1117.62339
[36] García-Escudero, L. A.; Gordaliza, A.; San Martin, R.; Van Aelst, S.; Zamar, R., Robust linear clustering, Journal of the Royal Statistical Society, Series B, 71, 301-308 (2009) · Zbl 1231.62112
[37] Gilmour, S. G., The interpretation of Mallows’s \(C_p\)-statistic, The Statistician, 45, 49-56 (1996)
[38] Grossi, L.; Laurini, F., A robust forward weighted Lagrange multiplier test for conditional heteroscedasticity, Computational Statistics and Data Analysis, 53, 2251-2263 (2009) · Zbl 1453.62100
[39] Guenther, W. C., An easy method for obtaining percentage points of order statistics, Technometrics, 19, 319-321 (1977) · Zbl 0371.62069
[40] Hadi, A. S., Identifying multiple outliers in multivariate data, Journal of the Royal Statistical Society, Series B, 54, 761-771 (1992)
[41] Hadi, A. S., A modification of a method for the detection of outliers in multivariate samples, Journal of the Royal Statistical Society, Series B, 56, 393-396 (1994) · Zbl 0800.62347
[42] Hadi, A. S.; Imon, A. H.M. R.; Werner, M., Detection of outliers, Wiley Interdisciplinary Reviews: Computational Statistics, 1, 57-70 (2009)
[43] Hadi, A. S.; Simonoff, J. S., Procedures for the identification of multiple outliers in linear models, Journal of the American Statistical Association, 88, 1264-1272 (1993)
[44] Haegerty, P.; Lumley, T., Window subsampling of estimating functions with application to regression models, Journal of the American Statistical Association, 95, 197-211 (2000) · Zbl 1013.62077
[45] Hampel, F. R., Beyond location parameters: Robust concepts and methods, Bulletin of the International Statistical Institute, 46, 375-382 (1975) · Zbl 0349.62029
[46] Hampel, F.; Ronchetti, E. M.; Rousseeuw, P.; Stahel, W. A., Robust statistics (1986), Wiley: Wiley New York
[47] Hardin, J.; Rocke, D. M., The distribution of robust distances, Journal of Computational and Graphical Statistics, 14, 910-927 (2005)
[48] Harvey, A. C.; Koopman, S. J., Diagnostic checking of unobserved components time series models, Journal of Business and Economic Statistics, 10, 377-389 (1992)
[49] Hastie, T.; Tibshirani, R.; Friedman, J., The elements of statistical learning. Data mining, inference and prediction (2009), Springer: Springer New York · Zbl 1273.62005
[50] Hawkins, D. M., Identification of outliers (1980), Chapman and Hall: Chapman and Hall London · Zbl 0438.62022
[51] Hawkins, D. M., Discussion of paper by Beckman and Cook, Technometrics, 25, 155-156 (1983)
[52] Huber, P. J., Robust statistics (1981), Wiley: Wiley New York · Zbl 0536.62025
[53] Huber, P. J.; Ronchetti, E. M., Robust statistics (2009), Wiley: Wiley New York · Zbl 1276.62022
[54] Hubert, M.; Rousseeuw, P. J.; Van Aelst, S., High-breakdown robust multivariate methods, Statistical Science, 23, 92-119 (2008) · Zbl 1327.62328
[55] Johnson, N. L.; Kotz, S.; Balakrishnan, N., Continuous univariate distributions—1 (1994), Wiley: Wiley New York · Zbl 0811.62001
[56] Mallows, C. L., Some comments on \(C_p\), Technometrics, 15, 661-675 (1973) · Zbl 0269.62061
[57] Maronna, R. A.; Martin, R. D.; Yohai, V. J., Robust statistics: Theory and methods (2006), Wiley: Wiley Chichester · Zbl 1094.62040
[58] Mavridis, D.; Moustaki, I., The forward search algorithm for detecting aberrant response patterns in factor analysis for binary data, Journal of Computational and Graphical Statistics, 18, 1016-1034 (2010)
[59] Morgenthaler, S., A survey of robust statistics, Statistical Methods and Applications, 15, 271-293 (2007), Statistical Methods and Applications, 16, 171-172 (erratum) · Zbl 1181.62029
[60] Müller, C.; Neykov, N., Breakdown points of the trimmed likelihood and related estimators in GLMs, Journal of Statistical Planning and Inference, 116, 503-519 (2003) · Zbl 1178.62074
[61] Perrotta, D.; Riani, M.; Torti, F., New robust dynamic plots for regression mixture detection, Advances in Data Analysis and Classification, 3, 263-279 (2009) · Zbl 1306.62079
[62] Proietti, T.; Riani, M., Seasonal adjustment and transformations, Journal of Time Series Analysis, 30, 47-69 (2009) · Zbl 1223.62154
[63] Riani, M., Extensions of the forward search to time series, Studies in Nonlinear Dynamics and Econometrics, 8, 1-23 (2004) · Zbl 1081.91592
[64] Riani, M., Robust transformations in univariate and multivariate time series, Econometric Reviews, 28, 262-278 (2009) · Zbl 1156.62057
[65] Riani, M.; Atkinson, A. C., Fast calibrations of the forward search for testing multiple outliers in regression, Advances in Data Analysis and Classification, 1, 123-141 (2007) · Zbl 1301.62069
[68] Riani, M.; Atkinson, A. C.; Cerioli, A., Finding an unknown number of multivariate outliers, Journal of the Royal Statistical Society, Series B, 71, 447-466 (2009) · Zbl 1248.62091
[69] Riani, M.; Cerioli, A.; Atkinson, A.; Perrotta, D.; Torti, F., Fitting mixtures of regression lines with the forward search, (Fogelman-Soulié, F.; Perrotta, D.; Piskorski, J.; Steinberger, R., Mining massive data sets for security (2008), IOS Press: IOS Press Amsterdam), 271-286
[70] Ronchetti, E.; Staudte, R. G., A robust version of Mallows’s \(C_p\), Journal of the American Statistical Association, 89, 550-559 (1994) · Zbl 0803.62026
[71] Rousseeuw, P. J., Least median of squares regression, Journal of the American Statistical Association, 79, 871-880 (1984) · Zbl 0547.62046
[72] Rousseeuw, P. J.; Leroy, A. M., Robust regression and outlier detection (1987), Wiley: Wiley New York · Zbl 0711.62030
[73] Solaro, N.; Pagani, M., The forward search for classical multidimensional scaling when the starting data matrix is known, (Lauro, C.; Palumbo, F.; Greenacre, M., Data analysis and classification: From the exploratory to the confirmatory approach (2010), Springer-Verlag: Springer-Verlag Berlin), 101-109
[74] Tallis, G. M., Elliptical and radial truncation in normal samples, Annals of Mathematical Statistics, 34, 940-944 (1963) · Zbl 0142.16104
[76] Weisberg, S., Applied linear regression (2005), Wiley: Wiley New York · Zbl 1068.62077
[77] Wilks, S. S., Multivariate statistical outliers, Sankhya A, 25, 407-426 (1963) · Zbl 0128.13401
[78] Wisnowski, J. W.; Montgomery, D. C.; Simpson, J. R., A comparative analysis of multiple outlier detection procedures in the linear regression model, Computational Statistics and Data Analysis, 36, 351-382 (2001) · Zbl 1038.62062
[79] Zani, S.; Riani, M.; Corbellini, A., Robust bivariate boxplots and multiple outlier detection, Computational Statistics and Data Analysis, 28, 257-270 (1998) · Zbl 1042.62545
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.