×

An explicit split point procedure in model-based trees allowing for a quick fitting of GLM trees and GLM forests. (English) Zbl 1477.62004

Summary: Classification and regression trees (CART) prove to be a true alternative to full parametric models such as linear models (LM) and generalized linear models (GLM). Although CART suffer from a biased variable selection issue, they are commonly applied to various topics and used for tree ensembles and random forests because of their simplicity and computation speed. Conditional inference trees and model-based trees algorithms for which variable selection is tackled via fluctuation tests are known to give more accurate and interpretable results than CART, but yield longer computation times. Using a closed-form maximum likelihood estimator for GLM, this paper proposes a split point procedure based on the explicit likelihood in order to save time when searching for the best split for a given splitting variable. A simulation study for non-Gaussian response is performed to assess the computational gain when building GLM trees. We also propose a benchmark on simulated and empirical datasets of GLM trees against CART, conditional inference trees and LM trees in order to identify situations where GLM trees are efficient. This approach is extended to multiway split trees and log-transformed distributions. Making GLM trees possible through a new split point procedure allows us to investigate the use of GLM in ensemble methods. We propose a numerical comparison of GLM forests against other random forest-type approaches. Our simulation analyses show cases where GLM forests are good challengers to random forests.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Box, GEP; Cox, DR, An analysis of transformations revisited, J. Am. Stat., 77, 209-210 (1964) · Zbl 0504.62058 · doi:10.1080/01621459.1982.10477788
[2] Breiman, L., Bagging predictors, Mach. Learn., 24, 2, 123-140 (1996) · Zbl 0858.68080
[3] Breiman, L., Random forests, Mach. Learn., 45, 1, 5-32 (2001) · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[4] Breiman, L.; Friedman, J.; Stone, CJ; Olshen, RA, Classification and Regression Trees (1984), Boca Raton: CRC, Boca Raton · Zbl 0541.62042
[5] Brouste, A.; Dutang, C.; Rohmer, T., Closed form maximum likelihood estimator for generalized linear models in the case of categorical explanatory variables: application to insurance loss modelling, Comput. Stat., 35, 689-724 (2020) · Zbl 1482.62005 · doi:10.1007/s00180-019-00918-7
[6] Chambers, JM; Hastie, TJ, Statistical Models in S (1993), London: Chapman and Hall, London · Zbl 0776.62007
[7] Ciampi, A., Generalized regression trees, Comput. Stat. Data Anal., 12, 1, 57-78 (1991) · Zbl 0825.62610 · doi:10.1016/0167-9473(91)90103-9
[8] Cortes, C.; Vapnik, V., Supportvector networks, Mach. Learn., 20, 3, 273-297 (1995) · Zbl 0831.68098
[9] Denuit, M.; Hainaut, D.; Trufin, J., Effective Statistical Learning Methods for Actuaries I: GLMs and extensions. Springer Actuarial Lecture Notes (2019), Berlin: Springer, Berlin · Zbl 1426.62003 · doi:10.1007/978-3-030-25820-7
[10] Fahrmeir, L.; Kaufmann, H., Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models, Ann. Stat., 13, 342-368 (1985) · Zbl 0594.62058 · doi:10.1214/aos/1176346597
[11] Farkas, S.; Lopez, O.; Thomas, M., Cyber claim analysis using Generalized Pareto regression trees with applications to insurance, Insur. Math. Econ., 98, 92-105 (2021) · Zbl 1466.91255 · doi:10.1016/j.insmatheco.2021.02.009
[12] Fokkema, M., Fitting prediction rule ensembles with R Package pre, J. Stat. Softw., 92, 1, 1-30 (2020)
[13] Friedman, JH, Stochastic gradient boosting, Comput. Stat. Data Anal. Nonlinear Methods Data Min., 38, 4, 367-378 (2002) · Zbl 1072.65502 · doi:10.1016/S0167-9473(01)00065-2
[14] Gama, J., Functional trees, Mach. Learn., 55, 3, 219-250 (2004) · Zbl 1469.68084 · doi:10.1023/B:MACH.0000027782.67192.13
[15] Garge, NR; Bobashev, G.; Eggleston, B., Random forest methodology for model-based recursive partitioning: the mobForest package for R, BMC Bioinform., 14, 125 (2013) · doi:10.1186/1471-2105-14-125
[16] Hothorn, T.; Zeileis, A., partykit: a modular toolkit for recursive partytioning in R, J. Mach. Learn. Res., 16, 3905-3909 (2015) · Zbl 1351.62005
[17] Hothorn, T.; Hornik, K.; Zeileis, A., Unbiased recursive partitioning: a conditional inference framework, J. Comput. Graph. Stat., 15, 3, 651-674 (2006) · doi:10.1198/106186006X133933
[18] James, G., Witten, D., Hastie, T., Tibshirani, R.: ISLR: data for an introduction to statistical learning with applications in R (2017)
[19] Kass, GV, An exploratory technique for investigating large quantities of categorical data, Ann. Appl. Stat., 29, 119-127 (1980) · doi:10.2307/2986296
[20] Kim, H.; Loh, W-Y, Classification trees with unbiased multiway splits, J. Am. Stat. Assoc., 96, 454, 589-604 (2001) · doi:10.1198/016214501753168271
[21] Landwehr, N.; Hall, M.; Eibe, F., Logistic model trees, Mach. Learn., 59, 161-205 (2005) · Zbl 1469.68092 · doi:10.1007/s10994-005-0466-3
[22] Lawrence, J., Introduction to Neural Networks: Design, Theory and Applications (1994), Nevada City: California Scientific Software, Nevada City
[23] Leisch, F., Dimitriadou, E.: mlbench: machine learning benchmark problems (2021)
[24] Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18-22 (2002)
[25] Liu, N.-T., Lin, F.-C., Shih, Y.-S.: Count regression trees. In: Advances in Data Analysis and Classification (2019)
[26] Loh, W-Y, Regression trees with unbiased variable selection and interaction detection, Stat. Sin., 12, 2, 361-386 (2002) · Zbl 0998.62042
[27] Loh, W-Y, Fifty years of classification and regression trees, Int. Stat. Rev., 82, 3, 329-348 (2014) · Zbl 1416.62347 · doi:10.1111/insr.12016
[28] Loh, W-Y; Shih, Y-S, Split selection methods for classification trees, Stat. Sin., 7, 4, 815-840 (1997) · Zbl 1067.62545
[29] Loh, W-Y; Vanichsetakul, N., Tree-structured classification via generalized discriminant analysis, J. Am. Stat. Assoc., 83, 403, 715-725 (1988) · Zbl 0649.62055 · doi:10.1080/01621459.1988.10478652
[30] McCullagh, P.; Nelder, JA, Generalized Linear Models. Statistics and Applied Probability (1989), Boca Raton: CRC, Boca Raton · Zbl 0588.62104 · doi:10.1007/978-1-4899-3242-6
[31] Philipp, M.; Rusch, T.; Hornik, K.; Strobl, C., Measuring the stability of results from supervised statistical learning, J. Comput. Graph. Stat., 27, 4, 685-700 (2018) · Zbl 07498983 · doi:10.1080/10618600.2018.1473779
[32] R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2021)
[33] Rusch, T.; Zeileis, A., Gaining insight with recursive partitioning of generalized linear models, J. Stat. Comput. Simul., 83, 7, 1301-1315 (2013) · Zbl 1431.62317 · doi:10.1080/00949655.2012.658804
[34] Seber, GAF; Lee, AJ, Linear Regression Analysis (2003), Hoboken: Wiley, Hoboken · Zbl 1029.62059 · doi:10.1002/9780471722199
[35] Seibold, H., Hothorn, T., Zeileis, A.: Generalised linear model trees with global additive effects. In: Advances in Data Analysis and Classification (2018) · Zbl 1474.62269
[36] Su, X.; Wang, M.; Fan, J., Maximum likelihood regression trees, J. Comput. Graph. Stat., 13, 3, 586-598 (2004) · doi:10.1198/106186004X2165
[37] Szöcs, E.; Schäfer, RB, Ecotoxicology is not normal: a comparison of statistical approaches for analysis of count and proportion data in ecotoxicology, Environ. Sci. Pollut. Res., 22, 18, 13990-13999 (2015) · doi:10.1007/s11356-015-4579-3
[38] Therneau, T., Atkinson, B.: rpart: recursive partitioning and regression trees (2019)
[39] Venables, WN; Ripley, BD, Modern Applied Statistics with S (2002), Berlin: Springer, Berlin · Zbl 1006.62003 · doi:10.1007/978-0-387-21706-2
[40] Weisberg, S., Applied Linear Regression (2005), Hoboken: Wiley, Hoboken · Zbl 1068.62077 · doi:10.1002/0471704091
[41] Wilson, K.; Grenfell, BT, Generalized linear modelling for parasitologists, Parasitol. Today, 13, 1, 33-38 (1997) · doi:10.1016/S0169-4758(96)40009-6
[42] Wood, SN, Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models, J. R. Stat. Soc. Ser. B (Stat. Methodol.), 73, 1, 3-36 (2011) · Zbl 1411.62089 · doi:10.1111/j.1467-9868.2010.00749.x
[43] Zeileis, A.; Hornik, K., Generalized M- fluctuation tests for parameter instability, Stat. Neerl., 61, 4, 488-508 (2007) · Zbl 1152.62014 · doi:10.1111/j.1467-9574.2007.00371.x
[44] Zeileis, A.; Hothorn, T.; Hornik, K., Model-based recursive partitioning, J. Comput. Graph. Stat., 17, 2, 492-514 (2008) · doi:10.1198/106186008X319331
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.