zbMATH — the first resource for mathematics

Impact of subsampling and tree depth on random forests. (English) Zbl 1409.62072
Summary: Random forests are ensemble learning methods introduced by L. Breiman [Mach. Learn. 45, No. 1, 5–32 (2001; Zbl 1007.68152)] that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman’s forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we study the influence of two parameters, namely the subsampling rate and the tree depth, on Breiman’s forests performance. More precisely, we prove that quantile forests (a specific type of random forests) based on subsampling and quantile forests whose tree construction is terminated early have similar performances, as long as their respective parameters (subsampling rate and tree depth) are well chosen. Moreover, experiments show that a proper tuning of these parameters leads in most cases to an improvement of Breiman’s original forests in terms of mean squared error.

62G05 Nonparametric estimation
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G20 Asymptotic properties of nonparametric inference
68T05 Learning and adaptive systems in artificial intelligence
Full Text: DOI
[1] S. Arlot and R. Genuer, Analysis of Purely Random Forests Bias. Preprint (2014). · Zbl 1402.62131
[2] G. Biau, Analysis of a random forests model. J. Mach. Learn. Res. 13 (2012) 1063-1095. · Zbl 1283.62127
[3] G. Biau and L. Devroye, Cellular tree classifiers, in Algorithmic Learning Theory. Springer, Cham (2014) 8-17. · Zbl 1432.68379
[4] G. Biau, L. Devroye and G. Lugosi, Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res. 9 (2008) 2015-2033. · Zbl 1225.62081
[5] L. Breiman, Random forests. Mach. Learn. 45 (2001) 5-32. · Zbl 1007.68152
[6] L. Breiman, J.H. Friedman, R.A. Olshen and C.J. Stone, Classification and Regression Trees. Chapman & Hall, CRC, Boca Raton (1984). · Zbl 0541.62042
[7] P. Bühlmann, Bagging, boosting and ensemble methods, in Handbook of Computational Statistics. Springer, Berlin, Heidelberg (2012) 985-1022.
[8] M. Denil, D. Matheson and N. de Freitas, Consistency of Online Random Forests. Vol. 28 of Proc. of ICML’13 Proceedings of the 30th International Conference on International Conference on Machine Learning, Atlanta, GA, USA June 6-21 (2013) 1256-1264.
[9] M. Denil, D. Matheson and N. de Freitas, Narrowing the gap: random forests in theory and in practice, in International Conference on Machine Learning (ICML) (2014).
[10] L. Devroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition. Springer, New York (1996).
[11] R. Díaz-Uriarte and S. Alvarez de Andrés, Gene selection and classification of microarray data using random forest. BMC Bioinform. 7 (2006) 1-13.
[12] M. Fernández-Delgado, E. Cernadas, S. Barro and D. Amorim, Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res. 15 (2014) 3133-3181. · Zbl 1319.62005
[13] R. Genuer, Variance reduction in purely random forests. J. Nonparametric Stat. 24 (2012) 543-562. · Zbl 1254.62050
[14] R. Genuer, J. Poggi and C. Tuleau-Malot, Variable selection using random forests. Pattern Recognit. Lett. 31 (2010) 2225-2236.
[15] H. Ishwaran and U.B. Kogalur, Consistency of random survival forests. Stat. Probab. Lett. 80 (2010) 1056-1064. · Zbl 1190.62177
[16] L. Meier, S. Van de Geer and P. Bühlmann, High-dimensional additive modeling. Ann. Stat. 37 (2009) 3779-3821. · Zbl 1360.62186
[17] L. Mentch and G. Hooker, Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J. Mach. Learn. Res. 17 (2015) 841-881. · Zbl 1360.62095
[18] Y. Qi, Random forest for bioinformatics, in Ensemble Machine Learning. Springer, Boston, MA (2012) 307-323.
[19] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite and P. H. Torr, Randomized trees for human pose detection, in IEEE Conference on Computer Vision and Pattern Recognition (2008) 1-8.
[20] M. Sabzevari, G. Martínez-Muñoz and A. Suárez, Improving the Robustness of Bagging with Reduced Sampling Size. Universitécatholique de Louvain (2014).
[21] E. Scornet, On the asymptotics of random forests. J. Multivar. Anal. 146 (2016) 72-83. · Zbl 1337.62063
[22] E. Scornet, G. Biau and J.-P. Vert, Consistency of random forests. Ann. Stat. 43 (2015) 1716-1741. · Zbl 1317.62028
[23] C.J. Stone, Optimal rates of convergence for nonparametric estimators. Ann. Stat. 8 (1980) 1348-1360. · Zbl 0451.62033
[24] C.J. Stone, Optimal global rates of convergence for nonparametric regression. Ann. Stat. 10 (1982) 1040-1053. · Zbl 0511.62048
[25] M. van der Laan, E.C. Polley and A.E. Hubbard, Super learner. Stat. Appl. Genet. Mol. Biol. 6 (2007).
[26] S. Wager, Asymptotic Theory for Random Forests. Preprint (2014).
[27] S. Wager and S. Athey, Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. (2018) 1-15. · Zbl 1402.62056
[28] S. Wager and G. Walther., Adaptive Concentration of Regression Trees, With Application to Random Forests (2015).
[29] F. Zaman and H. Hirose, Effect of subsampling rate on subbagging and related ensembles of stable classifiers, in International Conference on Pattern Recognition and Machine Intelligence. Springer (2009) 44-49.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.