×

Random survival forests for high-dimensional data. (English) Zbl 07260271

Summary: Minimal depth is a dimensionless order statistic that measures the predictiveness of a variable in a survival tree. It can be used to select variables in high-dimensional problems using Random Survival Forests (RSF), a new extension of Breiman’s Random Forests (RF) to survival settings. We review this methodology and demonstrate its use in high-dimensional survival problems using a public domain R-language package random Survival Forest. We discuss effective ways to regularize forests and discuss how to properly tune the RF parameters ‘nodesize’ and ‘mtry’. We also introduce new graphical ways of using minimal depth for exploring variable relationships.

MSC:

62-XX Statistics
68-XX Computer science
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] D. Nguyen and D. M. Rocke, Partial least squares proportional hazard regression for application to DNA microarray data, Bioinformatics 18 (2002), 1625-1632.
[2] H. Z. Li and J. Gui, Partial Cox regression analysis for high-dimensional microarray gene expression data, Bioinformatics 20 (2004), 208-215.
[3] E. Bair and R. Tibshirani, Semi-supervised methods to predict patient survival from gene expression data, PLoS Biol 2 (2004), 0511-0522.
[4] M.-Y. Park and T. Hastie,L1-regularization path algorithm for generalized linear models, JRSSB 69 (2007), 659-677. · Zbl 07555370
[5] H. H. Zhang and W. Lu, Adaptive Lasso for Cox’s proportional hazards model, Biometrika 94 (2007), 691-703. · Zbl 1135.62083
[6] H. Li and Y. Luan, Boosting proportional hazards models using smoothing splines, with applications to highdimensional microarray data, Bioinformatics 21 (2006), 2403-2409.
[7] S. Ma and J. Huang, Clustering threshold gradient descent regularization: with applications to microarray studies, Bioinformatics 23 (2006), 466-472.
[8] T. Hothorn and P. Buhlmann, Model-based boosting in highdimensions, Bioinformatics 22 (2006), 2828-2829.
[9] B. Binder and M. Schumacher, Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models, BMC Bioinform 9 (2008), 14.
[10] G. Ridgeway, The state of boosting, Comput Sci Stat 31 (1999), 172-181.
[11] L. Breiman, Random forests, Mach Learn 45 (2001), 5-32. · Zbl 1007.68152
[12] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, California, Belmont, 1984. · Zbl 0541.62042
[13] L. Breiman, Bagging predictors, Mach Learn 26 (1996), 123-140. · Zbl 0858.68080
[14] L. Breiman, Heuristics of instability and stabilization in model selection, Ann Stat 24 (1996), 2350-2383. · Zbl 0867.62055
[15] A. Liaw and M. Wiener. randomForest 4.5-36. R package, 2010, http://cran.r-project.org.
[16] K. L. Lunetta, L. B. Hayward, J. Segal, and P. V. Eerdewegh, Screening large-scale association study data: exploiting interactions using random forests, BMC Genet 5 (2004), 32.
[17] A. Bureau, J. Dupuis, K. Falls, K. L. Lunetta, B. Hayward, T. P. Keith, and P. V. Eerdewegh, Identifying SNPs predictive of phenotype using random forests, Genet Epidemiol 28 (2005), 171-182.
[18] R. Diaz-Uriarte and S. Alvarez de Andres, Gene selection and classification of microarray data using random forest, BMC Bioinform 7 (2006), 3.
[19] H. Ishwaran, U. B. Kogalur, E. H. Blackstone, and M. S. Lauer, Random survival forests, Ann Appl Stat 2 (2008), 841-860. · Zbl 1149.62331
[20] H. Ishwaran and U. B. Kogalur, Random survival forests for R, Rnews 7/2 (2007), 25-31.
[21] H. Ishwaran and U. B. Kogalur, RandomSurvivalForest: Random Survival Forests. R package version 3.6.3, 2010, http://cran.r-project.org. · Zbl 1190.62177
[22] H. Ishwaran, U. B. Kogalur, E. Z. Gorodeski, A. J. Minn, and M. S. Lauer, High-dimensional variable selection for survival data, J Am Stat Assoc 105 (2010), 205-217. · Zbl 1397.62220
[23] H. Ishwaran, Variable importance in binary regression trees and forests, Electron J Stat 1 (2007), 519-537. · Zbl 1320.62158
[24] H. Ishwaran, U. B. Kogalur, R. D. Moore, S. J. Gange, and B. M. Lau, Random survival forests for competing risks (submitted), 2010.
[25] R. Genuer, J.-M. Poggi, and C. Tuleau, Random Forests: some methodological insights,ArXiv e-prints, 0811.3619, 2008.
[26] M. R. Segal, Regression trees for censored data, Biometrics 44 (1988), 35-47. · Zbl 0707.62224
[27] P. Geurts, D. Ernst, and L. Wehenkel, Extremely randomized trees, Mach Learn 63 (2006), 3-42. · Zbl 1110.68124
[28] D. Amaratunga, J. Cabrera, and Y.-S. Lee, Enriched random forests, Bioinformatics 24(18) (2008), 2010-2014.
[29] H. Binder. Cox models by likelihood based boosting for a single survival endpoint or competing risks. R package version 1.2-1, 2010, http://cran.r-project.org.
[30] L. J. van’t Veer, H. Dai, M. J. van de Vijver, D. Yudong, A. A. M. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend, Gene expression profiling predicts clinical outcome of breast cancer, Nature 415 (2002), 530-536.
[31] M. J. van de Vijver, Y. D. He, L. J. van’t Veer, H. Dai, A. A. M. Hart, D. W. Voskuil, G. J. Schreiber, J. L. Peterse, C.Roberts,M. J.Marton,M.Parrish,D. Atsma, A. Witteveen, A. Glas, L. Delahaye, T. van der Velde, H. Bartelink, S. Rodenhuis, E. T. Rutgers, S. H. Friend, and R. Bernards, A gene-expression signature as a predictor of survival in breast cancer, N Engl J Med 347 (2002), 1999-2009.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.