Conditional predictive inference post model selection. (English) Zbl 1173.62026

Summary: We give a finite-sample analysis of predictive inference procedures after model selection in a regression with random design. The analysis is focused on a statistically challenging scenario where the number of potentially important explanatory variables can be infinite, where no regularity conditions are imposed on the unknown parameters, where the number of explanatory variables in a “good” model can be of the same order as the sample size and where the number of candidate models can be of larger order than the sample size. The performance of inference procedures is evaluated conditional on the training sample.
Under weak conditions on only the number of candidate models and on their complexity, and uniformly over all data-generating processes under consideration, we show that a certain prediction interval is approximately valid and short with high probability in finite samples, in the sense that its actual coverage probability is close to the nominal one and in the sense that its length is close to the length of an infeasible interval that is constructed by actually knowing the “best” candidate model. Similar results are shown to hold for predictive inference procedures other than prediction intervals like, for example, tests of whether a future response will lie above or below a given threshold.


62G08 Nonparametric regression and quantile regression
62G15 Nonparametric tolerance and confidence regions
62H12 Estimation in multivariate analysis
62J05 Linear regression; mixed models
62J07 Ridge regression; shrinkage estimators (Lasso)
60E15 Inequalities; stochastic orderings
65C60 Computational problems in statistics (MSC2010)
Full Text: DOI arXiv


[1] Adam, B.-L., Qu, Y., Davis, J. W., Ward, M. D., Clements, M. A., Cazares, L. H., Semmes, O. J., Schellmanner, P. F., Yasui, Y., Feng, Z. and Wright, G. L. J. (2002). Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Research 62 3609-3614.
[2] Baraud, Y. (2004). Confidence balls in Gaussian regression. Ann. Statist. 32 528-551. · Zbl 1093.62051
[3] Barndorff-Nielsen, O. E. and Cox, D. R. (1996). Prediction and asymptotics. Bernoulli 2 319-340. · Zbl 0870.62008
[4] Beran, R. and Dümbgen, L. (1998). Modulation of estimators and confidence sets. Ann. Statist. 26 1826-1856. · Zbl 1073.62538
[5] Breiman, L. and Freedman, D. (1983). How many variables should be entered in a regression equation? J. Amer. Statist. Assoc. 78 131-136. JSTOR: · Zbl 0513.62068
[6] Cai, T. T. and Low, M. G. (2004). An adaptation theory for nonparametric confidence intervals. Ann. Statist. 32 1805-1840. · Zbl 1056.62060
[7] Cai, T. T. and Low, M. G. (2006). Adaptive confidence balls. Ann. Statist. 34 202-228. · Zbl 1091.62037
[8] Ding, A. A. and Hwang, J. T. G. (1999). Prediction intervals, factor analysis models, and high-dimensional empirical linear prediction. J. Amer. Statist. Assoc. 94 446-455. JSTOR: · Zbl 1072.62592
[9] Geisser, S. (1993). Predictive Inference : An Introduction. Monographs on Statistics and Applied Probability 55 . Chapman & Hall, New York. · Zbl 0824.62001
[10] Genovese, C. R. and Wasserman, L. (2005). Confidence sets for nonparametric wavelet regression. Ann. Statist. 33 698-729. · Zbl 1068.62057
[11] Genovese, C. R. and Wasserman, L. (2008). Adaptive confidence bands. Ann. Statist. 36 875-905. · Zbl 1139.62311
[12] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, D. C. and Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531-537.
[13] Hocking, R. R. (1976). The analysis and selection of variables in linear regression. Biometrics 32 1-49. JSTOR: · Zbl 0328.62042
[14] Hoffmann, M. and Lepski, O. (2002). Random rates in anisotropic regression. Ann. Statist. 30 325-396. · Zbl 1012.62042
[15] Joshi, V. M. (1969). Admissibility of the usual confidence sets for the mean of a univariate or bivariate normal population. Ann. Math. Statist. 40 1042-1067. · Zbl 0205.46202
[16] Juditsky, A. and Lambert-Lacroix, S. (2003). Nonparametric confidence set estimation. Math. Methods Statist. 12 410-428.
[17] Kabaila, P. and Leeb, H. (2006). On the large-sample minimal coverage probability of confidence intervals after model selection. J. Amer. Statist. Assoc. 101 619-629. · Zbl 1119.62322
[18] Leeb, H. (2005). The distribution of a linear predictor after model selection: Conditional finite-sample distributions and asymptotic approximations. J. Statist. Plann. Inference 134 64-89. · Zbl 1066.62071
[19] Leeb, H. (2006). The distribution of a linear predictor after model selection: Unconditional finite-sample distributions and asymptotic approximations. IMS Lecture Notes-Monograph Series 49 291-311. · Zbl 1268.62064
[20] Leeb, H. (2008). Evaluation and selection of models for out-of-sample prediction when the sample size is small relative to the complexity of the data-generating process. Bernoulli 14 661-690. · Zbl 1155.62029
[21] Leeb, H. and Pötscher, B. M. (2003). The finite-sample distribution of post-model-selection estimators, and uniform versus non-uniform approximations. Econometric Theory 19 100-142. JSTOR: · Zbl 1032.62011
[22] Leeb, H. and Pötscher, B. M. (2005). Model selection and inference: Facts and fiction. Econometric Theory 21 21-59. · Zbl 1085.62004
[23] Leeb, H. and Pötscher, B. M. (2006). Can one estimate the conditional distribution of post-model-selection estimators? Ann. Statist. 34 2554-2591. · Zbl 1106.62029
[24] Leeb, H. and Pötscher, B. M. (2008). Can one estimate the unconditional distribution of post-model-selection estimators? Econometric Theory 24 338-376. · Zbl 1284.62152
[25] Li, K.-C. (1989). Honest confidence regions for nonparametric regression. Ann. Statist. 17 1001-1008. · Zbl 0681.62047
[26] Nychka, D. (1988). Bayesian confidence intervals for smoothing splines. J. Amer. Statist. Assoc. 83 1134-1143. JSTOR:
[27] Pötscher, B. M. (1991). Effects of model selection on inference. Econometric Theory 7 163-185. JSTOR: · Zbl 04504752
[28] Robins, J. and van der Vaart, A. (2006). Adaptive nonparametric confidence sets. Ann. Statist. 34 229-253. · Zbl 1091.62039
[29] Shen, X., Huang, H.-C. and Ye, J. (2004). Inference after model selection. J. Amer. Statist. Assoc. 99 751-761. · Zbl 1117.62423
[30] Souders, T. M. and Stenbakken, G. N. (1991). Cutting the high cost of testing. IEEE Spectrum 28 48-51.
[31] Stenbakken, G. N. and Souders, T. M. (1987). Test point selection and testability measures via QR factorization of linear models. IEEE Trans. Instrum. Meas. 36 406-410.
[32] Thompson, M. L. (1978). Selection of variables in multiple regression: Part II. Chosen procedures, computations and examples. Int. Statist. Rev. 46 129-146. · Zbl 0426.62046
[33] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67 91-108. JSTOR: · Zbl 1060.62049
[34] van de Vijver, M. J., He, Y. D., van’t Veer, L. J., Dai, H., Hart, A. A. M., Voskuil, D. W., Schreiber, G. J., Peterse, J. L., Roberts, C., Marton, M. J., Parrish, M., Atsma, D., Witteveen, A., Glas, A., Delahaye, L., van der Velde, T., Bartelink, H., Rodenhuis, S., Rutgers, E. T., Friend, S. H. and Bernards, R. (2002). A gene-expression signature as a predictor of survival in breast cancer. The New England Journal of Medicine 347 1999-2009.
[35] van’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530-536.
[36] Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. J. Amer. Statist. Assoc. 45 133-150. JSTOR: · Zbl 0538.65006
[37] West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson, J. A. J., Marks, J. R. and Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. U.S.A. 98 11462-11467.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.