## Valid inference corrected for outlier removal.(English)Zbl 07499259

Summary: Ordinary least square (OLS) estimation of a linear regression model is well-known to be highly sensitive to outliers. It is common practice to (1) identify and remove outliers by looking at the data and (2) to fit OLS and form confidence intervals and $$p$$-values on the remaining data as if this were the original data collected. This standard “detect-and-forget” approach has been shown to be problematic, and in this article we highlight the fact that it can lead to invalid inference and show how recently developed tools in selective inference can be used to properly account for outlier detection and removal. Our inferential procedures apply to a general class of outlier removal procedures that includes several of the most commonly used approaches. We conduct simulations to corroborate the theoretical results, and we apply our method to three real datasets to illustrate how our inferential results can differ from the traditional detect-and-forget strategy. A companion R package, outference, implements these new procedures with an interface that matches the functions commonly used for inference with lm in R. Supplementary materials for this article are available online.

### MSC:

 62-XX Statistics

### Software:

R; selectiveInference; outference
Full Text:

### References:

 [1] Atkinson, A. C., Plots, Transformations and Regression; an Introduction to Graphical Methods of Diagnostic Regression Analysis (1985) · Zbl 0582.62065 [2] Atkinson, A. C., “Influential Observations, High Leverage Points, and Outliers in Linear Regression: Comment: Aspects of Diagnostic Regression Analysis,”, Statistical Science, 1, 397-402 (1986) [3] Belsley, D. A.; Kuh, E.; Welsch, R. E., Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, 571 (2005), New York: Wiley, New York · Zbl 0479.62056 [4] Berenguer-Rico, R.; Nielsen, B., Marked and Weighted Empirical Processes of Residuals With Applications to Robust Regressions (2017) [5] Berenguer-Rico, V.; Wilms, I., White Heteroscedasticty Testing After Outlier Removal (2018) [6] Bi, N.; Markovic, J.; Xia, L.; Taylor, J., “Inferactive Data Analysis,”, arXiv no. 1707.06692 (2017) · Zbl 1446.62340 [7] Brownlee, K. A., Statistical Theory and Methodology in Science and Engineering, 150 (1965), New York: Wiley, New York · Zbl 0136.39203 [8] Cook, R. D., “Detection of Influential Observation in Linear Regression,”, Technometrics, 19, 15-18 (1977) · Zbl 0371.62096 [9] Daniel, C.; Wood, F. S., Fitting Equations to Data: Computer Analysis of Multifactor Data (1999), New York: Wiley, New York · Zbl 0998.65503 [10] Eichholtz, P.; Kok, N.; Quigley, J. M., “Doing Well by Doing Good? Green Office Buildings,”, American Economic Review, 100, 2492-2509 (2010) [11] Fithian, W.; Sun, D.; Taylor, J., “Optimal Inference After Model Selection,”, arXiv no. 1410.2597 (2014) [12] Hadi, A., A Stepwise Procedure for Identifying Multiple Outliers in Linear Regression, 137, 142 (1990) [13] Hadi, A. S.; Simonoff, J. S., “Procedures for the Identification of Multiple Outliers in Linear Models,”, Journal of the American Statistical Association, 88, 1264-1272 (1993) [14] Harris, X. T.; Panigrahi, S.; Markovic, J.; Bi, N.; Taylor, J., “Selective Sampling After Solving a Convex Problem,”, arXiv no. 1609.05609 (2016) [15] Hoeting, J.; Raftery, A. E.; Madigan, D., “A Method for Simultaneous Variable Selection and Outlier Identification in Linear Regression,”, Computational Statistics & Data Analysis, 22, 251-270 (1996) · Zbl 0900.62352 [16] Huber, P. J., “A Robust Version of the Probability Ratio Test,”, The Annals of Mathematical Statistics, 36, 1753-1758 (1965) · Zbl 0137.12702 [17] Huber, P. J., Breakthroughs in Statistics, “Robust Estimation of a Location Parameter,”, 492-518 (1992), New York: Springer, New York [18] Huber, P.; Ronchetti, E., Robust Statistics, Wiley Series in Probability and Mathematical Statistics, 52, 54 (1981), New York: Wiley, New York [19] Kibert, C. J., Sustainable Construction: Green Building Design and Delivery (2016), Hoboken, NJ: Wiley, Hoboken, NJ [20] Lee, J. D.; Sun, D. L.; Sun, Y.; Taylor, J. E., “Exact Post-Selection Inference, With Application to the Lasso,”, The Annals of Statistics, 44, 907-927 (2016) · Zbl 1341.62061 [21] Loftus, J. R.; Taylor, J. E., Selective Inference in Regression Models With Groups of Variables, arXiv no. 1511.01478 (2015) [22] Maronna, R.; Martin, R. D.; Yohai, V., Robust Statistics, 1 (2006), Chichester: Wiley, Chichester · Zbl 1094.62040 [23] Panigrahi, S.; Taylor, J.; Weinstein, A., “Bayesian Post-Selection Inference in the Linear Model,”, arXiv no. 1605.08824 (2016) [24] R Core Team, R: A Language and Environment for Statistical Computing (2017), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria [25] Reid, S.; Tibshirani, R.; Friedman, J., A Study of Error Variance Estimation in Lasso Regression, arXiv no. 1311.5274 (2013) [26] Rousseeuw, P. J., “Least Median of Squares Regression,”, Journal of the American Statistical Association, 79, 871-880 (1984) · Zbl 0547.62046 [27] She, Y.; Owen, A. B., “Outlier Detection Using Nonconvex Penalized Regression,”, Journal of the American Statistical Association, 106, 626-639 (2011) · Zbl 1232.62068 [28] Taylor, J.; Tibshirani, R. J., “Statistical Learning and Selective Inference,”, Proceedings of the National Academy of Sciences of the United States of America, 112, 7629-7634 (2015) · Zbl 1359.62228 [29] Thompson, R., “A Note on Restricted Maximum Likelihood Estimation With an Alternative Outlier Model,”, Journal of the Royal Statistical Society, Series B, 47, 53-55 (1985) [30] Tian, X.; Taylor, J., “Selective Inference With a Randomized Response,”, The Annals of Statistics, 46, 679-710 (2018) · Zbl 1392.62144 [31] Tibshirani, R., “Regression Shrinkage and Selection via the Lasso,”, Journal of the Royal Statistical Society, Series B, 58, 267-288 (1996) · Zbl 0850.62538 [32] Tibshirani, R.; Taylor, J.; Loftus, J.; Reid, S., selectiveInference: Tools for Post-Selection Inference, R Package Version 1.2.2 (2017) [33] Welsch, R. E.; Kuh, E., Linear Regression Diagnostics. NBER Working Papers 0173 (1977), National Bureau of Economic Research, Inc [34] Welsh, A. H.; Ronchetti, E., “A Journey in Single Steps: Robust One-Step M-Estimation in Linear Regression,”, Journal of Statistical Planning and Inference, 103, 287-310 (2002) · Zbl 0988.62040 [35] Yekutieli, D., “Adjusted Bayesian Inference for Selected Parameters,”, Journal of the Royal Statistical Society, Series B, 74, 515-541 (2012) · Zbl 1411.62074 [36] Yohai, V. J., “High Breakdown-Point and High Efficiency Robust Estimates for Regression,”, The Annals of Statistics, 15, 642-656 (1987) · Zbl 0624.62037 [37] Zaman, A.; Rousseeuw, P. J.; Orhan, M., “Econometric Applications of High-Breakdown Robust Regression Techniques,”, Economics Letters, 71, 1-8 (2001) · Zbl 0984.91065 [38] Zhao, J.; Leng, C.; Li, L.; Wang, H., “High-Dimensional Influence Measure,”, The Annals of Statistics, 41, 2639-2667 (2013) · Zbl 1360.62411
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.