×

Oblique random survival forests. (English) Zbl 1433.62305

Summary: We introduce and evaluate the oblique random survival forest (ORSF). The ORSF is an ensemble method for right-censored survival data that uses linear combinations of input variables to recursively partition a set of training data. Regularized Cox proportional hazard models are used to identify linear combinations of input variables in each recursive partitioning step. Benchmark results using simulated and real data indicate that the ORSF’s predicted risk function has high prognostic value in comparison to random survival forests, conditional inference forests, regression and boosting. In an application to data from the Jackson Heart Study, we demonstrate variable and partial dependence using the ORSF and highlight characteristics of its ten-year predicted risk function for atherosclerotic cardiovascular disease events (ASCVD; stroke, coronary heart disease). We present visualizations comparing variable and partial effect estimation according to the ORSF, the conditional inference forest, and the Pooled Cohort Risk equations. The obliqueRSF R package, which provides functions to fit the ORSF and create variable and partial dependence plots, is available on the comprehensive R archive network (CRAN).

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62N02 Estimation in survival analysis and censored data
62-08 Computational methods for problems pertaining to statistics
PDFBibTeX XMLCite
Full Text: DOI Euclid

References:

[1] Andersen, P. K., Borgan, O., Gill, R. D. and Keiding, N. (2012). Statistical Models Based on Counting Processes. Springer, Berlin · Zbl 0824.60003
[2] Bien, J. and Tibshirani, R. (2019). protoclust: Hierarchical clustering with prototypes. R package version 1.6.3. · Zbl 1229.62083 · doi:10.1198/jasa.2011.tm10183
[3] Binder, H. (2013). CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks. R package version 1.4, available at https://CRAN.R-project.org/package=CoxBoost.
[4] Blanche, P., Kattan, M. W. and Gerds, T. A. (2019). The c-index is not proper for the evaluation of \(t\)-year predicted risks. Biostatistics 20 347-357.
[5] Bou-Hamad, I., Larocque, D. and Ben-Ameur, H. (2011). A review of survival trees. Stat. Surv. 5 44-71. · Zbl 1274.62648 · doi:10.1214/09-SS047
[6] Breiman, L. (1984). Classification and Regression Trees. Routledge, Abingdon. · Zbl 0541.62042
[7] Breiman, L. (2001). Random forests. Mach. Learn. 45 5-32. · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[8] Breiman, L. and Cutler, A. (2003). Setting up, using, and understanding random forests V4.0. Dept. Statistics, Univ. California, Berkeley.
[9] Brilleman, S. (2018). simsurv: Simulate survival data. R package version 0.2.2, available at https://CRAN.R-project.org/package=simsurv.
[10] Burnham, K. P. and Anderson, D. R. (2004). Multimodel inference: Understanding AIC and BIC in model selection. Sociol. Methods Res. 33 261-304.
[11] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM International Conference on Knowledge Discovery and Data Mining 785-794. ACM.
[12] Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I. et al. (2019). xgboost: Extreme gradient boosting. R package version 0.81.0.1, available at https://CRAN.R-project.org/package=xgboost.
[13] Cox, D. R. (1992). Regression models and life-tables. In Breakthroughs in Statistics 527-541. Springer, Berlin.
[14] Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 1-30. · Zbl 1222.68184
[15] Desmedt, C., Di Leo, A., de Azambuja, E., Larsimont, D., Haibe-Kains, B., Selleslags, J., Delaloge, S., Duhem, C., Kains, J.-P. et al. (2011). Multifactorial approach to predicting resistance to anthracyclines. J. Clin. Oncol. 29 1578-1586.
[16] Dheeru, D. and Karra Taniskidou, E. (2017). UCI Machine learning repository. Univ. California, Irvine.
[17] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407-499. · Zbl 1091.62054 · doi:10.1214/009053604000000067
[18] Fleming, T. R. and Harrington, D. P. (2011). Counting Processes and Survival Analysis 169. Wiley, New York.
[19] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189-1232. · Zbl 1043.62034 · doi:10.1214/aos/1013203451
[20] Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 1-22. Available at http://www.jstatsoft.org/v33/i01/.
[21] Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Amer. Statist. Assoc. 32 675-701. · JFM 63.1098.02 · doi:10.1080/01621459.1937.10503522
[22] Gerds, T. A., Kattan, M. W., Schumacher, M. and Yu, C. (2013). Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Stat. Med. 32 2173-2184.
[23] Geurts, P., Ernst, D. and Wehenkel, L. (2006). Extremely randomized trees. Mach. Learn. 63 3-42. · Zbl 1110.68124 · doi:10.1007/s10994-006-6226-1
[24] Graf, E., Schmoor, C., Sauerbrei, W. and Schumacher, M. (1999). Assessment and comparison of prognostic classification schemes for survival data. Stat. Med. 18 2529-2545.
[25] Harrell, F. E., Califf, R. M., Pryor, D. B., Lee, K. L. and Rosati, R. A. (1982). Evaluating the yield of medical tests. JAMA 247 2543-2546.
[26] Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, New York. · Zbl 0973.62007
[27] Hatzis, C., Pusztai, L., Valero, V., Booser, D. J., Esserman, L., Lluch, A., Vidaurre, T., Holmes, F., Souchon, E. et al. (2011). A genomic predictor of response and survival following taxane-anthracycline chemotherapy for invasive breast cancer. JAMA 305 1873-1881.
[28] Heagerty, P. J., Lumley, T. and Pepe, M. S. (2000). Time-dependent ROC curves for censored survival data and a diagnostic marker. Biometrics 56 337-344. · Zbl 1060.62622 · doi:10.1111/j.0006-341X.2000.00337.x
[29] Heagerty, P. J. and Zheng, Y. (2005). Survival model predictive accuracy and ROC curves. Biometrics 61 92-105. · Zbl 1077.62077 · doi:10.1111/j.0006-341X.2005.030814.x
[30] Hothorn, T., Hornik, K., Strobl, C. and Zeileis, A. (2019). party: A laboratory for recursive partytioning. R package version 1.3.3, available at https://CRAN.R-project.org/package=party.
[31] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15 651-674.
[32] Hothorn, T. and Lausen, B. (2003). Double-bagging: Combining classifiers by bootstrap aggregation. Pattern Recognit. 36 1303-1309. · Zbl 1028.68144 · doi:10.1016/S0031-3203(02)00169-3
[33] Hothorn, T., Lausen, B., Benner, A. and Radespiel-Tröger, M. (2004). Bagging survival trees. Stat. Med. 23 77-91.
[34] Howard, V. J., Cushman, M., Pulley, L., Gomez, C. R., Go, R. C., Prineas, R. J., Graham, A., Moy, C. S. and Howard, G. (2005). The reasons for geographic and racial differences in stroke study: Objectives and design. Neuroepidemiology 25 135-143.
[35] Iman, R. L. and Davenport, J. M. (1980). Approximations of the critical region of the Fbietkan statistic. Comm. Statist. Theory Methods 9 571-595. · Zbl 0451.62061 · doi:10.1080/03610928008827904
[36] Ishwaran, H. and Kogalur, U. B. (2019). Random forests for survival, regression, and classification (RF-SRC). R package version 2.8.0, available at https://cran.r-project.org/package=randomForestSRC.
[37] Ishwaran, H., Kogalur, U. B., Blackstone, E. H. and Lauer, M. S. (2008). Random survival forests. Ann. Appl. Stat. 2 841-860. · Zbl 1149.62331 · doi:10.1214/08-AOAS169
[38] Jaeger, B. (2018). obliqueRSF: Oblique random forests for right-censored time-to-event data. R package version 0.1.0, available at https://CRAN.R-project.org/package=obliqueRSF.
[39] Jaeger, B. C., Long, L. D., Long, D. M., Sims, M., Szychowski, J. M., Min, Y.-I., Mcclure, L. A., Howard, G. and Simon, N. (2019). Supplement to “Oblique random survival forests.” DOI:10.1214/19-AOAS1261SUPP. · Zbl 1433.62305
[40] Kowarik, A. and Templ, M. (2016). Imputation with the R package VIM. J. Stat. Softw. 74 1-16.
[41] Levey, A. S., Stevens, L. A., Schmid, C. H., Zhang, Y. L., Castro, A. F., Feldman, H. I., Kusek, J. W., Eggers, P., Van Lente, F. et al. (2009). A new equation to estimate glomerular filtration rate. Ann. Intern. Med. 150 604-612.
[42] Lundberg, S. M., Erion, G. G. and Lee, S.-I. (2018). Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888.
[43] McCall, M. N., Bolstad, B. M. and Irizarry, R. A. (2010). Frozen robust multiarray analysis (fRMA). Biostatistics 11 242-253. · Zbl 1437.62556
[44] Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 Paper No. 26, 41. · Zbl 1360.62095
[45] Menze, B. H., Kelm, B. M., Splitthoff, D. N., Koethe, U. and Hamprecht, F. A. (2011). On oblique random forests. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases 453-469. Springer, Berlin.
[46] Mogensen, U. B., Ishwaran, H. and Gerds, T. A. (2012). Evaluating random forests for survival analysis using prediction error curves. J. Stat. Softw. 50 1.
[47] Morris, T. P., White, I. R. and Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Stat. Med. 38 2074-2102.
[48] Nasejje, J. B., Mwambi, H., Dheda, K. and Lesosky, M. (2017). A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data. BMC Med. Res. Methodol. 17 115.
[49] Rainforth, T. and Wood, F. (2015). Canonical correlation forests. arXiv preprint arXiv:1507.05444.
[50] Safford, M. M., Brown, T. M., Muntner, P. M., Durant, R. W., Glasser, S., Halanych, J. H., Shikany, J. M., Prineas, R. J., Samdarshi, T. et al. (2012). Association of race and sex with risk of incident acute coronary heart disease events. JAMA 308 1768-1774.
[51] Schumacher, M., Bastert, G., Bojar, H., Huebner, K., Olschewski, M., Sauerbrei, W., Schmoor, C., Beyerle, C., Neumann, R. et al. (1994). Randomized \(2 \times 2\) trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. J. Clin. Oncol. 12 2086-2093.
[52] Segal, M. R. (1988). Regression trees for censored data. Biometrics 44 35-47. · Zbl 0707.62224 · doi:10.2307/2531894
[53] Shabalin, A. A., Tjelmeland, H., Fan, C., Perou, C. M. and Nobel, A. B. (2008). Merging two gene-expression studies via cross-platform normalization. Bioinformatics 24 1154-1160.
[54] Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39 1-13.
[55] Strasser, H. and Weber, C. (1999). The asymptotic theory of permutation statistics. Math. Methods Statist. 8 220-250. Johann Pfanzagl—on the occasion of his 70th birthday. · Zbl 1103.62346
[56] Strobl, C., Malley, J. and Tutz, G. (2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychol. Methods 14 323-348.
[57] Strobl, C., Boulesteix, A.-L., Zeileis, A. and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 8 25.
[58] Taylor Jr., H. A., Wilson, J. G., Jones, D. W., Sarpong, D. F., Srinivasan, A., Garrison, R. J., Nelson, C. and Wyatt, S. B. (2005). Toward resolution of cardiovascular health disparities in African americans: Design and methods of the Jackson heart study. Ethn. Dis. 15 S6-4.
[59] Ternès, N., Rotolo, F., Heinze, G. and Michiels, S. (2017). Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biom. J. 59 685-701. · Zbl 1369.62306 · doi:10.1002/bimj.201500234
[60] Therneau, T. M. (2015). A package for survival analysis in S. R package version 2.38, available at https://CRAN.R-project.org/package=survival.
[61] Tutz, G. and Binder, H. (2007). Boosting ridge regression. Comput. Statist. Data Anal. 51 6044-6059. · Zbl 1330.62294 · doi:10.1016/j.csda.2006.11.041
[62] van Houwelingen, H. C., Bruinsma, T., Hart, A. A. M., van’t Veer, L. J. and Wessels, L. F. A. (2006). Cross-validated Cox regression on microarray gene expression data. Stat. Med. 25 3201-3216.
[63] Van’t Veer, L. J., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530.
[64] Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S, 4th ed. Springer, New York. · Zbl 1006.62003
[65] Whelton, P. K., Carey, R. M., Aronow, W. S., Casey, D. E., Collins, K. J., Himmelfarb, C. D., DePalma, S. M., Gidding, S., Jamerson, K. A. et al. (2018). 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA guideline for the prevention, detection, evaluation, and management of high blood pressure in adults: A report of the American college of cardiology/American heart association task force on clinical practice guidelines. J. Am. Coll. Cardiol. 71 e127-e248.
[66] Zhu, R. (2013). Tree-Based Methods for Survival Analysis and High-Dimensional Data. Thesis (Ph.D.)-Univ. North Carolina at Chapel Hill. ProQuest LLC, Ann Arbor, MI.
[67] Zhu, R., Zeng, D. and Kosorok, M. R. (2015). Reinforcement learning trees. J. Amer. Statist. Assoc. 110 1770-1784. · Zbl 1374.68466 · doi:10.1080/01621459.2015.1036994
[68] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 301-320. · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.