×

Does data splitting improve prediction? (English) Zbl 1342.62025

Summary: Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.

MSC:

62F10 Point estimation
62-07 Data analysis (statistics) (MSC2010)
62C05 General considerations in statistical decision theory

Software:

MASS (R); R
PDFBibTeX XMLCite
Full Text: DOI arXiv Link

References:

[1] Altman, D.G., Royston, P.: What do we mean by validating a prognostic model? Stat. Med. 19(4), 453-473 (2000)
[2] Bell, R., Koren, Y.: Lessons from the Netflix prize challenge. ACM SIGKDD Explor. Newsl. 9(2), 75-79 (2007)
[3] Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521-547 (2013) · Zbl 1456.62066
[4] Berk, R., Brown, L., Zhao, L.: Statistical inference after model selection. J. Quant. Criminol. 26(2), 217-236 (2009)
[5] Carpenter, J.: May the best analyst win. Science 331(6018), 698-699 (2011)
[6] Chatfield, C.: Model uncertainty, data mining and statistical inference. J. R. Statist. Soc. Ser. A 158(3), 419-466 (1995)
[7] Cox, D.: A note on data-splitting for the evaluation of significance levels. Biometrika 62, 441-444 (1975) · Zbl 0309.62014
[8] Dahl, F., Grotle, M., Saltyte Benth, J., Natvig, B.: Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain. Eur. J. Epidemiol. 23(4), 237-242 (2008)
[9] Dawid, A.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A 147, 278-292 (1984) · Zbl 0557.62080
[10] Draper, D.: Assessment and propogation of model uncertainty. J. R. Stat. Soc. Ser. B 57, 45-97 (1995) · Zbl 0812.62001
[11] Faraway, J.: On the cost of data analysis. J. Comput. Gr. Stat. 1, 215-231 (1992)
[12] Friedman, J., Hastie, T., Tibshirani, R.: Elements Statistical Learning, 2nd edn. Springer, New York (2008) · Zbl 0973.62007
[13] Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102(477), 359-378 (2007) · Zbl 1284.62093
[14] Good, I.J.: Rational decisions. J. R. Stat. Soc. Ser. B 14(1), 107-114 (1952)
[15] Heller, R., Rosenbaum, P.R., Small, D.S.: Split samples and design sensitivity in observational studies. J. Am. Stat. Assoc. 104(487), 1090-1101 (2009) · Zbl 1388.62231
[16] Hinkley, D., Runger, G.: The analysis of transformed data (with discussion). J. Am. Stat. Assoc. 79, 302-319 (1984) · Zbl 0553.62051
[17] Hirsch, R.: Validation samples. Biometrics 47(3), 1193-1194 (1991)
[18] Lawless, J.F., Fredette, M.: Frequentist prediction intervals and predictive distributions. Biometrika 92(3), 529-542 (2005) · Zbl 1183.62052
[19] Leeb, H., Pötscher, B.M.: Model selection and inference: facts and fiction. Econom. Theory 21(01), 21-59 (2005) · Zbl 1085.62004
[20] Little, R.: Calibrated bayes. Am. Stat. 60(3), 213-223 (2006)
[21] Meng, X., Xie, X.: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? Econom. Rev. 33, 1-33 (2013)
[22] Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (1990) · Zbl 0702.62057
[23] Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301-3307 (2005)
[24] Mosteller, F., Tukey, J.: Data Analysis and Regression. A Second Course in Statistics. Addison-Wesley, Reading (1977)
[25] Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561-592 (2012) · Zbl 1246.62011
[26] Picard, R., Berk, K.: Data splitting. Am. Stat. 44, 140-147 (1990)
[27] Picard, R., Cook, R.: Cross-validation of regression models. J. Am. Stat. Assoc. 79, 575-583 (1984) · Zbl 0547.62047
[28] Pötscher, B.: Effects of model selection on inference. Econom. Theory 7(2), 163-185 (1991)
[29] Roecker, E.: Prediction error and its estimation for subset-selected models. Technometrics 33, 459-468 (1991)
[30] Schumacher, M., Binder, H., Gerds, T.: Assessment of survival prediction models based on microarray data. Bioinformatics 23(14), 1768-1774 (2007)
[31] Steyerberg, E.: Clinical Prediction Models. Springer, New York (2009) · Zbl 1314.92010
[32] Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 36, 111-147 (1974) · Zbl 0308.62063
[33] Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002) · Zbl 1006.62003
[34] Wit, E., Heuvel, E.V.D., Romeijn, J.W.: All models are wrong..: an introduction to model uncertainty. Stat. Neerl. 66(3), 217-236 (2012)
[35] Xie, M.G., Singh, K.: Confidence distribution, the frequentist distribution estimator of a parameter — a review. Int. Stat. Rev. 81, 3-39 (2013) · Zbl 1416.62170
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.