×

zbMATH — the first resource for mathematics

An alternative pruning based approach to unbiased recursive partitioning. (English) Zbl 06917862
Summary: Tree-based methods are a non-parametric modelling strategy that can be used in combination with generalized linear models or Cox proportional hazards models, mostly at an exploratory stage. Their popularity is mainly due to the simplicity of the technique along with the ease in which the resulting model can be interpreted. Variable selection bias from variables with many possible splits or missing values has been identified as one of the problems associated with tree-based methods. A number of unbiased recursive partitioning algorithms have been proposed that avoid this bias by using \(p\)-values in the splitting procedure of the algorithm. The final tree is obtained using direct stopping rules (pre-pruning strategy) or by growing a large tree first and pruning it afterwards (post-pruning). Some of the drawbacks of pre-pruned trees based on \(p\)-values in the presence of interaction effects and a large number of explanatory variables are discussed, and a simple alternative post-pruning solution is presented that allows the identification of such interactions. The proposed method includes a novel pruning algorithm that uses a false discovery rate (FDR) controlling procedure for the determination of splits corresponding to significant tests. The new approach is demonstrated with simulated and real-life examples.
MSC:
62 Statistics
Software:
evtree; partykit; R; rpart
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Benjamini, Y.; Hochberg, Y., Controlling the false discovery rate: A practical and powerful approach to multiple testing, J. R. Stat. Soc. Ser. B Stat. Methodol., 57, 1, 289-300, (1995) · Zbl 0809.62014
[2] Breiman, L.; Friedman, J. H.; Stone, C. J.; Olshen, R. A., Classification and regression trees, (1984), Chapman & Hall/CRC Boca Raton, Florida · Zbl 0541.62042
[3] Chambers, J. M.; Hastie, T., Statistical models in S, (Advanced Books & Software, (1992), Wadsworth & Brooks/Cole) · Zbl 0776.62007
[4] Davis, R. B.; Anderson, J. R., Exponential survival trees, Stat. Med., 8, 8, 947-961, (1989)
[5] Gordon, L.; Olshen, R. A., Tree-structured survival analysis, Cancer Treat. Rep., 69, 10, 1065-1069, (1985)
[6] Grubinger, T.; Zeileis, A.; Pfeiffer, K.-P., Evtree: evolutionary learning of globally optimal classification and regression trees in R, J. Statistical Software, 61, 1, 1-29, (2014)
[7] Hothorn, T.; Hornik, K.; Zeileis, A., Unbiased recursive partitioning: A conditional inference framework, J. Comput. Graph. Statist., 15, 3, 651-674, (2006)
[8] Hothorn, T.; Zeileis, A., Partykit: A modular toolkit for recursive partytioning in R, J. Mach. Learn. Res., 16, 3905-3909, (2013), URL http://jmlr.org/papers/v16/hothorn15a.html · Zbl 1351.62005
[9] Ingoldsby, H.; Webber, M.; Wall, D.; Scarrott, C.; Newell, J.; Callagy, G., Prediction of oncotype DX and tailorx risk categories using histopathological and immunohistochemical markers by classification and regression tree (CART) analysis, Breast, 22, 5, 879-886, (2013)
[10] Kim, H.; Loh, W.-Y., Classification trees with unbiased multiway splits, J. Amer. Statist. Assoc., 96, 454, 598-604, (2001)
[11] Kim, H.; Loh, W.-Y., Classification trees with bivariate linear discriminant node models, J. Comput. Graph. Statist., 12, 3, 512-530, (2003)
[12] LeBlanc, M.; Crowley, J., Survival trees by goodness of split, J. Amer. Statist. Assoc., 88, 422, 457-467, (1993) · Zbl 0773.62071
[13] Loh, W.-Y., Regression trees with unbiased variable selection and interaction detection, Statist. Sinica, 12, 2, 361-386, (2002) · Zbl 0998.62042
[14] Loh, W.-Y.; Shih, Y.-S., Split selection methods for classification trees, Statist. Sinica, 7, 4, 815-840, (1997) · Zbl 1067.62545
[15] Loh, W.-Y.; Vanichsetakul, N., Tree-structured classification via generalized discriminant analysis, J. Amer. Statist. Assoc., 83, 403, 715-725, (1988) · Zbl 0649.62055
[16] Morgan, J.; Sonquist, J., Problems in the analysis of survey data, and a proposal., J. Amer. Statist. Assoc., 58, 415-434, (1963) · Zbl 0114.10103
[17] R Core Team, 2016. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, version 3.2.4 Revised. URL https://www.R-project.org/.
[18] Segal, M. R., Regression trees for censored data, Biometrics, 44, 1, 35-47, (1988) · Zbl 0707.62224
[19] Shih, Y.-S.; Tsai, H.-W., Variable selection bias in regression trees with constant fits, Comput. Statist. Data Anal., 45, 3, 595-607, (2004) · Zbl 1429.62725
[20] Strasser, H.; Weber, C., On the asymptotic theory of permutation statistics, Math. Methods Statist., 8, 2, 220-250, (1999) · Zbl 1103.62346
[21] Strobl, C.; Malley, J.; Tutz, G., An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests, Psychol. Methods, 14, 4, 323-348, (2009)
[22] Therneau, T. M.; Atkinson, E. J., An introduction to recursive partitioning using the rpart routine. technical report 61, section of biostatistics, (1997), Mayo Clinic Rochester
[23] Therneau, T., Atkinson, B., Ripley, B., 2015. rpart: Recursive Partitioning and Regression Trees. R package version 4.1-10. URL https://CRAN.R-project.org/package=rpart.
[24] White, A. P.; Liu, W. Z., Technical note: bias in information-based measures in decision tree induction, Mach. Learn., 15, 321-329, (1994) · Zbl 0942.68718
[25] Zeileis, A.; Hothorn, T.; Hornik, K., Model-based recursive partitioning, J. Comput. Graph. Statist., 17, 2, 492-514, (2008)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.