Tree ensembles with rule structured horseshoe regularization.

*(English)*Zbl 1412.62169Summary: We propose a new Bayesian model for flexible nonlinear regression and classification using tree ensembles. The model is based on the RuleFit approach in [J. H. Friedman and B. E. Popescu, Ann. Appl. Stat. 2, No. 3, 916–954 (2008; Zbl 1149.62051)] where rules from decision trees and linear terms are used in a L1-regularized regression. We modify RuleFit by replacing the L1-regularization by a horseshoe prior, which is well known to give aggressive shrinkage of noise predictors while leaving the important signal essentially untouched. This is especially important when a large number of rules are used as predictors as many of them only contribute noise. Our horseshoe prior has an additional hierarchical layer that applies more shrinkage a priori to rules with a large number of splits, and to rules that are only satisfied by a few observations. The aggressive noise shrinkage of our prior also makes it possible to complement the rules from boosting in RuleFit with an additional set of trees from Random Forest, which brings a desirable diversity to the ensemble. We sample from the posterior distribution using a very efficient and easily implemented Gibbs sampler. The new model is shown to outperform state-of-the-art methods like RuleFit, BART and Random Forest on 16 datasets. The model and its interpretation is demonstrated on the well known Boston housing data, and on gene expression data for cancer classification. The posterior sampling, prediction and graphical tools for interpreting the model results are implemented in a publicly available \(\mathtt R\) package.

##### MSC:

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

62J02 | General nonlinear regression |

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

62M20 | Inference from stochastic processes and prediction |

##### Keywords:

nonlinear regression; classification; decision trees; Bayesian model; prediction; MCMC; interpretation##### Citations:

Zbl 1149.62051
PDF
BibTeX
XML
Cite

\textit{M. Nalenz} and \textit{M. Villani}, Ann. Appl. Stat. 12, No. 4, 2379--2408 (2018; Zbl 1412.62169)

**OpenURL**

##### References:

[1] | Breiman, L. (1996). Stacked regressions. Mach. Learn.24 49–64. · Zbl 0849.68104 |

[2] | Breiman, L. (2001). Random forests. Mach. Learn.45 5–32. · Zbl 1007.68152 |

[3] | Carvalho, C. M., Polson, N. G. and Scott, J. G. (2009). Handling sparsity via the horseshoe. In AISTATS5 73–80. |

[4] | Carvalho, C. M., Polson, N. G. and Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika97 465–480. · Zbl 1406.62021 |

[5] | Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794. ACM, New York. |

[6] | Chipman, H. A., George, E. I. and McCulloch, R. E. (2010). BART: Bayesian additive regression trees. Ann. Appl. Stat.4 266–298. · Zbl 1189.62066 |

[7] | Cohen, W. W. (1995). Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning (ICML’95) 115–123. |

[8] | Dembczyński, K., Kotłowski, W. and Słowiński, R. (2010). ENDER: A statistical framework for boosting decision rules. Data Min. Knowl. Discov.21 52–90. |

[9] | Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference96 148–156. Bari, Italy. |

[10] | Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist.29 1189–1232. · Zbl 1043.62034 |

[11] | Friedman, J. H. and Popescu, B. E. (2008). Predictive learning via rule ensembles. Ann. Appl. Stat.2 916–954. · Zbl 1149.62051 |

[12] | Fürnkranz, J. (1999). Separate-and-conquer rule learning. Artif. Intell. Rev.13 3–54. · Zbl 0922.68030 |

[13] | George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc.88 881–889. |

[14] | Hahn, P. R. and Carvalho, C. M. (2015). Decoupling shrinkage and selection in Bayesian linear models: A posterior summary perspective. J. Amer. Statist. Assoc.110 435–448. · Zbl 1373.62036 |

[15] | Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. and Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 3149–3157. |

[16] | Li, L. and Yao, W. (2014). Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensional feature selection. J. Stat. Comput. Simul.88 2827–2851. |

[17] | Linero, A. R. (2018). Bayesian regression trees for high dimensional prediction and variable selection. J. Amer. Statist. Assoc.113 626–636. · Zbl 1398.62065 |

[18] | Makalic, E. and Schmidt, D. F. (2016). A simple sampler for the horseshoe estimator. IEEE Signal Process. Lett.23 179–182. |

[19] | Nalenz, M. and Villani, M. (2018). Supplement to “Tree ensembles with rule structured horseshoe regularization.” DOI:10.1214/18-AOAS1157SUPP. |

[20] | Piironen, J. and Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Stat. Comput.27 711–735. · Zbl 06737693 |

[21] | Polson, N. G., Scott, J. G. and Windle, J. (2013). Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Amer. Statist. Assoc.108 1339–1349. · Zbl 1283.62055 |

[22] | Puelz, D., Hahn, P. R. and Carvalho, C. M. (2017). Variable selection in seemingly unrelated regressions with random predictors. Bayesian Anal.12 969–989. · Zbl 1384.62262 |

[23] | Rokach, L. (2010). Ensemble-based classifiers. Artif. Intell. Rev.33 1–39. |

[24] | Schapire, R. E. (1999). A brief introduction to boosting. In IJCAI 1401–1406. |

[25] | Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P. et al. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell1 203–209. |

[26] | Slonim, D. K. (2002). From patterns to pathways: Gene expression data analysis comes of age. Nat. Genet.32 (Supp) 502. |

[27] | Smith, M. and Kohn, R. (1996). Nonparametric regression using Bayesian variable selection. J. Econometrics75 317–343. · Zbl 0864.62025 |

[28] | Terenin, A., Dong, S. and Draper, D. (2016). GPU-accelerated Gibbs sampling. Preprint. Available at arXiv:1608.04329. |

[29] | Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B58 267–288. · Zbl 0850.62538 |

[30] | Van’t Veer, L., Dai, H., Van De Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., Van Der Kooy, K., Marton, M. J., Witteveen, A. T. et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature415 530–536. |

[31] | Wolpert, D. H. (1992). Stacked generalization. Neural Netw.5 241–259. |

[32] | Yap, Y., Zhang, X., Ling, M. T., Wang, X., Wong, Y. C. and Danchin, A. (2004). Classification between normal and tumor tissues based on the pair-wise gene expression ratio. BMC Cancer4 72. |

[33] | Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B. Stat. Methodol.67 301–320. · Zbl 1069.62054 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.