A study of pre-validation. (English) Zbl 1273.62126

Summary: Given a predictor of outcome derived from a high-dimensional dataset, pre-validation is a useful technique for comparing it to competing predictors on the same dataset. For microarray data, it allows one to compare a newly derived predictor for disease outcome to standard clinical predictors on the same dataset. We study pre-validation analytically to determine if the inferences drawn from it are valid. We show that while pre-validation generally works well, the straightforward “one degree of freedom” analytical test from pre-validation can be biased and we propose a permutation test to remedy this problem. In simulation studies, we show that the permutation test has the nominal level and achieves roughly the same power as the analytical test.


62H15 Hypothesis testing in multivariate analysis


Full Text: DOI arXiv


[1] Chang, H. Y., Nuyten, D. S., Sneddon, J. B., Hastie, T., Tibshirani, R., Sorlie, T., Dai, H., He, Y. D., van’t Veer, L. J., Bartelink, H., van de Rijn, M., Brown, P. O. and van de Vijver, M. J. (2005). Robustness, scalability and integration of a wound-response gene expression signature in predicting breast cancer survival., Proc. Natl. Acad. Sci. USA 102 3531-3532.
[2] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion)., Ann. Statist. 32 407-499. · Zbl 1091.62054
[3] Höfling, H. and Tibshirani, R. (2008). Supplement to “A study of pre-validation.” DOI:, 10.1214/08-AOAS152SUPP. · Zbl 1273.62126
[4] Park, M.-Y. and Hastie, T. (2007)., L 1 -regularization path algorithm for generalized linear models. J. Roy. Statist. Soc. Ser. B 69 659-677.
[5] Pepe, M. S., Janes, H., Longton, G., Leisenring, W. and Newcomb, P. (2004). Limitations of the odds ratio in gauging the performance of a diagnostic, prognostic, or screening marker., American J. Epidemiology 159 882-890.
[6] Tibshirani, R. J. and Efron, B. (2002). Pre-validation and inference in microarrays., Statist. Appl. Genet. Mol. Biol. 1 1-18. · Zbl 1037.62116
[7] van’t Veer, L. J., van de Vijver, H. D. M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, G. J. S. A. T., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer., Nature 415 530-536.
[8] Ware, J. H. (2006). The limitations of risk factors as prognostic tools., The New England J. Medicine 355 2615-2617.
[9] Zhu, J. and Hastie, T. (2004). Classification of gene microarrays by penalized logistic regression., Biostatistics 5 427-443. · Zbl 1154.62406
[10] Zhu, X., Ambroise, C. and McLachlan, G. J. (2006). Selection bias in working with the top genes in supervised classification of tissue samples., Statist. Methodol. 3 29-41. · Zbl 1248.92023
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.