×

A random forest guided tour. (English) Zbl 1402.62133

Summary: The random forest algorithm, proposed by L. Breiman [Mach. Learn. 45, No. 1, 5–32 (2001; Zbl 1007.68152)], has been extremely successful as a general-purpose classification and regression method. The approach, which combines several randomized decision trees and aggregates their predictions by averaging, has shown excellent performance in settings where the number of variables is much larger than the number of observations. Moreover, it is versatile enough to be applied to large-scale problems, is easily adapted to various ad hoc learning tasks, and returns measures of variable importance. The present article reviews the most recent theoretical and methodological developments for random forests. Emphasis is placed on the mathematical forces driving the algorithm, with special attention given to the selection of parameters, the resampling mechanism, and variable importance measures. This review is intended to provide non-experts easy access to the main ideas.

MSC:

62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G08 Nonparametric regression and quantile regression
62G09 Nonparametric statistical resampling methods
68T05 Learning and adaptive systems in artificial intelligence
68R10 Graph theory (including graph drawing) in computer science
62-02 Research exposition (monographs, survey articles) pertaining to statistics

Citations:

Zbl 1007.68152
PDFBibTeX XMLCite
Full Text: DOI arXiv

References:

[1] Amaratunga D, Cabrera J, Lee Y-S (2008) Enriched random forests. Bioinformatics 24:2010-2014 · doi:10.1093/bioinformatics/btn356
[2] Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9:1545-1588 · doi:10.1162/neco.1997.9.7.1545
[3] Archer KJ, Kimes RV (2008) Empirical characterization of random forest variable importance measures. Comput Stat Data Anal 52:2249-2260 · Zbl 1452.62027 · doi:10.1016/j.csda.2007.08.015
[4] Arlot S, Genuer R (2014) Analysis of purely random forests bias. arXiv:1407.3939 · Zbl 1402.62131
[5] Auret L, Aldrich C (2011) Empirical comparison of tree ensemble variable importance measures. Chemom Intell Lab Syst 105:157-170 · doi:10.1016/j.chemolab.2010.12.004
[6] Bai Z-H, Devroye L, Hwang H-K, Tsai T-H (2005) Maxima in hypercubes. Random Struct Algorithms 27:290-309 · Zbl 1080.60007 · doi:10.1002/rsa.20053
[7] Banerjee M, McKeague IW (2007) Confidence sets for split points in decision trees. Ann Stat 35:543-574 · Zbl 1117.62037 · doi:10.1214/009053606000001415
[8] Barndorff-Nielsen O, Sobel M (1966) On the distribution of the number of admissible points in a vector random sample. Theory Probab Appl 11:249-269 · Zbl 0278.60007 · doi:10.1137/1111020
[9] Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2:1-127 · Zbl 1192.68503 · doi:10.1561/2200000006
[10] Bernard, S.; Heutte, L.; Adam, S.; Huang, D-S (ed.); Wunsch, DC (ed.); Levine, DS (ed.); Jo, K-H (ed.), Forest-RK: a new random forest induction method, 430-437 (2008), Berlin · doi:10.1007/978-3-540-85984-0_52
[11] Bernard S, Adam S, Heutte L (2012) Dynamic random forests. Pattern Recognit Lett 33:1580-1586 · doi:10.1016/j.patrec.2012.04.003
[12] Biau G (2012) Analysis of a random forests model. J Mach Learn Res 13:1063-1095 · Zbl 1283.62127
[13] Biau G, Devroye L (2010) On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivar Anal 101:2499-2518 · Zbl 1198.62048 · doi:10.1016/j.jmva.2010.06.019
[14] Biau G, Devroye L (2013) Cellular tree classifiers. Electron J Stat 7:1875-1912 · Zbl 1293.62067 · doi:10.1214/13-EJS829
[15] Biau G, Devroye L, Lugosi G (2008) Consistency of random forests and other averaging classifiers. J Mach Learn Res 9:2015-2033 · Zbl 1225.62081
[16] Biau G, Cérou F, Guyader A (2010) On the rate of convergence of the bagged nearest neighbor estimate. J Mach Learn Res 11:687-712 · Zbl 1242.62025
[17] Boulesteix A-L, Janitza S, Kruppa J, König IR (2012) Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscip Rev Data Mining Knowl Discov 2:493-507 · doi:10.1002/widm.1072
[18] Breiman L (1996) Bagging predictors. Mach Learn 24:123-140 · Zbl 0858.68080
[19] Breiman L (2000a) Some infinity theory for predictor ensembles. Technical Report 577, University of California, Berkeley
[20] Breiman L (2000b) Randomizing outputs to increase prediction accuracy. Mach Learn 40:229-242 · Zbl 0962.68143 · doi:10.1023/A:1007682208299
[21] Breiman L (2001) Random forests. Mach Learn 45:5-32 · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[22] Breiman L (2003a) Setting up, using, and understanding random forests V3.1. https://www.stat.berkeley.edu/ breiman/Using_random_forests_V3.1.pdf
[23] Breiman L (2003b) Setting up, using, and understanding random forests V4.0. https://www.stat.berkeley.edu/ breiman/Using_random_forests_v4.0.pdf
[24] Breiman L (2004) Consistency for a simple model of random forests. Technical Report 670, University of California, Berkeley
[25] Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Chapman & Hall/CRC, Boca Raton · Zbl 0541.62042
[26] Bühlmann P, Yu B (2002) Analyzing bagging. Ann Stat 30:927-961 · Zbl 1029.62037 · doi:10.1214/aos/1031689014
[27] Chen C, Liaw A, Breiman L (2004) Using random forest to learn imbalanced data. Technical Report 666, University of California, Berkeley · Zbl 0987.68896
[28] Clémençon S, Depecker M, Vayatis N (2013) Ranking forests. J Mach Learn Res 14:39-73 · Zbl 1307.68065
[29] Clémençon S, Vayatis N (2009) Tree-based ranking methods. IEEE Trans Inform Theory 55:4316-4336 · Zbl 1262.68150 · doi:10.1109/TIT.2009.2025558
[30] Criminisi A, Shotton J, Konukoglu E (2011) Decision forests: a unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Found Trends Comput Graph Vis 7:81-227 · Zbl 1243.68235 · doi:10.1561/0600000035
[31] Crookston NL, Finley AO (2008) yaImpute: an R package for \[k\] kNN imputation. J Stat Softw 23:1-16 · doi:10.18637/jss.v023.i10
[32] Cutler A, Zhao G (2001) PERT—perfect random tree ensembles. Comput Sci Stat 33:490-497
[33] Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88:2783-2792 · doi:10.1890/07-0539.1
[34] Davies A, Ghahramani Z (2014) The Random Forest Kernel and creating other kernels for big data from random partitions. arXiv:1402.4293
[35] Deng H, Runger G (2012) Feature selection via regularized trees. In: The 2012 international joint conference on neural networks, pp 1-8 · Zbl 1222.68262
[36] Deng H, Runger G (2013) Gene selection with guided regularized random forest. Pattern Recognit 46:3483-3489 · doi:10.1016/j.patcog.2013.05.018
[37] Denil M, Matheson D, de Freitas N (2013) Consistency of online random forests. In: International conference on machine learning (ICML)
[38] Denil M, Matheson D, de Freitas N (2014) Narrowing the gap: random forests in theory and in practice. In: International conference on machine learning (ICML) · Zbl 1452.62027
[39] Désir C, Bernard S, Petitjean C, Heutte L (2013) One class random forests. Pattern Recognit 46:3490-3506 · doi:10.1016/j.patcog.2013.05.022
[40] Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York · Zbl 0853.68150 · doi:10.1007/978-1-4612-0711-5
[41] Díaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7:1-13 · Zbl 1320.62158
[42] Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) Multiple classifier systems. Springer, Berlin, pp 1-15 · Zbl 1190.62177
[43] Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7:1-26 · Zbl 0406.62024 · doi:10.1214/aos/1176344552
[44] Efron B (1982) The jackknife, the bootstrap and other resampling plans, vol 38. CBMS-NSF Regional Conference Series in Applied Mathematics, Philadelphia · Zbl 0496.62036
[45] Fink D, Hochachka WM, Zuckerberg B, Winkler DW, Shaby B, Munson MA, Hooker G, Riedewald M, Sheldon D, Kelling S (2010) Spatiotemporal exploratory models for broad-scale survey data. Ecol Appl 20:2131-2147 · doi:10.1890/09-1340.1
[46] Friedman J, Hastie T, Tibshirani R (2009) The elements of statistical learning, 2nd edn. Springer, New York · Zbl 1273.62005
[47] Genuer R (2012) Variance reduction in purely random forests. J Nonparametr Stat 24:543-562 · Zbl 1254.62050 · doi:10.1080/10485252.2012.677843
[48] Genuer R, Poggi J-M, Tuleau-Malot C (2010) Variable selection using random forests. Pattern Recognit Lett 31:2225-2236 · doi:10.1016/j.patrec.2010.03.014
[49] Geremia E, Menze BH, Ayache N (2013) Spatially adaptive random forests. In: IEEE international symposium on biomedical imaging: from nano to macro, pp 1332-1335 · Zbl 0406.62024
[50] Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63:3-42 · Zbl 1110.68124 · doi:10.1007/s10994-006-6226-1
[51] Gregorutti B, Michel B, Saint Pierre P (2016) Correlation and variable importance in random forests. Stat Comput. doi:10.1007/s11222-016-9646-1 · Zbl 1505.62167
[52] Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389-422 · Zbl 0998.68111 · doi:10.1023/A:1012487302797
[53] Györfi L, Kohler M, Krzyżak A, Walk H (2002) A distribution-free theory of nonparametric regression. Springer, New York · Zbl 1021.62024 · doi:10.1007/b97848
[54] Ho T (1998) The random subspace method for constructing decision forests. Pattern Anal Mach Intell 20:832-844 · doi:10.1109/34.709601
[55] Hothorn T, Hornik K, Zeileis A (2006) Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat 15:651-674 · doi:10.1198/106186006X133933
[56] Howard J, Bowles M (2012) The two most important algorithms in predictive modeling today. In: Strata Conference: Santa Clara. http://strataconf.com/strata2012/public/schedule/detail/22658
[57] Ishioka T (2013) Imputation of missing values for unsupervised data using the proximity in random forests. In: eLmL 2013, The fifth international conference on mobile, hybrid, and on-line learning, pp 30-36. International Academy, Research, and Industry Association
[58] Ishwaran H (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519-537 · Zbl 1320.62158 · doi:10.1214/07-EJS039
[59] Ishwaran H (2013) The effect of splitting on random forests. Mach Learn 99:75-118 · Zbl 1320.62015 · doi:10.1007/s10994-014-5451-2
[60] Ishwaran H, Kogalur UB (2010) Consistency of random survival forests. Stat Probab Lett 80:1056-1064 · Zbl 1190.62177 · doi:10.1016/j.spl.2010.02.020
[61] Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS (2008) Random survival forests. Ann Appl Stat 2:841-860 · Zbl 1149.62331 · doi:10.1214/08-AOAS169
[62] Ishwaran H, Kogalur UB, Chen X, Minn AJ (2011) Random survival forests for high-dimensional data. Stat Anal Data Mining ASA Data Sci J 4:115-132 · Zbl 07260271 · doi:10.1002/sam.10103
[63] Jeffrey D, Sanja G (2008) Simplified data processing on large clusters. Commun ACM 51:107-113
[64] Joly, A.; Geurts, P.; Wehenkel, L.; Calders, T. (ed.); Esposito, F. (ed.); Hüllermeier, E. (ed.); Meo, R. (ed.), Random forests with random projections of the output space for high dimensional multi-label classification, 607-622 (2014), Berlin
[65] Kim H, Loh W-Y (2001) Classification trees with unbiased multiway splits. J Am Stat Assoc 96:589-604 · doi:10.1198/016214501753168271
[66] Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. J Royal Stat Soc Ser B (Stat Methodol) 76:795-816 · Zbl 07555464 · doi:10.1111/rssb.12050
[67] Konukoglu E, Ganz M (2014) Approximate false positive rate control in selection frequency for random forest. arXiv:1410.2838
[68] Kruppa J, Schwarz A, Arminger G, Ziegler A (2013) Consumer credit risk: individual probability estimates using machine learning. Expert Syst Appl 40:5125-5131 · doi:10.1016/j.eswa.2013.03.019
[69] Kruppa J, Liu Y, Biau G, Kohler M, König IR, Malley JD, Ziegler A (2014a) Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biometr J 56:534-563 · Zbl 1441.62404 · doi:10.1002/bimj.201300068
[70] Kruppa J, Liu Y, Diener H-C, Holste T, Weimar C, König IR, Ziegler A (2014b) Probability estimation with machine learning methods for dichotomous and multicategory outcome: applications. Biometr J 56:564-583 · Zbl 1441.62405 · doi:10.1002/bimj.201300077
[71] Kuhn M, Johnson K (2013) Applied predictive modeling. Springer, New York · Zbl 1306.62014 · doi:10.1007/978-1-4614-6849-3
[72] Kyrillidis A, Zouzias A (2014) Non-uniform feature sampling for decision tree ensembles. In: IEEE international conference on acoustics, speech and signal processing, pp 4548-4552 · Zbl 1119.62304
[73] Lakshminarayanan B, Roy DM, Teh YW (2014) Mondrian forests: efficient online random forests. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, pp 3140-3148
[74] Latinne, P.; Debeir, O.; Decaestecker, C.; Kittler, J. (ed.); Roli, F. (ed.), Limiting the number of trees in random forests, 178-187 (2001), Berlin · Zbl 0987.68896 · doi:10.1007/3-540-48219-9_18
[75] Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2:18-22
[76] Lin Y, Jeon Y (2006) Random forests and adaptive nearest neighbors. J Am Stat Assoc 101:578-590 · Zbl 1119.62304 · doi:10.1198/016214505000001230
[77] Louppe G, Wehenkel L, Sutera A, Geurts P (2013) Understanding variable importances in forests of randomized trees. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, pp 431-439 · Zbl 1080.60007
[78] Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A (2012) Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inform Med 51:74-81 · doi:10.3414/ME00-01-0052
[79] Meinshausen N (2006) Quantile regression forests. J Mach Learn Res 7:983-999 · Zbl 1222.68262
[80] Meinshausen N (2009) Forest Garrote. Electron J Stat 3:1288-1304 · Zbl 1326.62093 · doi:10.1214/09-EJS434
[81] Mentch L, Hooker G (2014) A novel test for additivity in supervised ensemble learners. arXiv:1406.1845 · Zbl 1319.62132
[82] Mentch L, Hooker G (2015) Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. J Mach Learn Res (in press) · Zbl 1360.62095
[83] Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA (2011) On oblique random forests. In: Gunopulos D, Hofmann T, Malerba D, Vazirgiannis M (eds) Machine learning and knowledge discovery in databases. Springer, Berlin, pp 453-469 · Zbl 1029.62037
[84] Nadaraya EA (1964) On estimating regression. Theory Probab Appl 9:141-142 · Zbl 0136.40902 · doi:10.1137/1109020
[85] Nicodemus KK, Malley JD (2009) Predictor correlation impacts machine learning algorithms: Implications for genomic studies. Bioinformatics 25:1884-1890 · doi:10.1093/bioinformatics/btp331
[86] Politis DN, Romano JP, Wolf M (1999) Subsampling. Springer, New York · Zbl 0931.62035 · doi:10.1007/978-1-4612-1554-7
[87] Prasad AM, Iverson LR, Liaw A (2006) Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9:181-199 · doi:10.1007/s10021-005-0054-1
[88] Qian SS, King RS, Richardson CJ (2003) Two statistical methods for the detection of environmental thresholds. Ecol Model 166:87-97 · doi:10.1016/S0304-3800(03)00097-8
[89] Rieger A, Hothorn T, Strobl C (2010) Random forests with missing values in the covariates. Technical Report 79, University of Munich, Munich
[90] Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: IEEE 12th international conference on computer vision workshops, pp 1393-1400 · Zbl 1225.62081
[91] Schwarz DF, König IR, Ziegler A (2010) On safari to random jungle: a fast implementation of random forests for high-dimensional data. Bioinformatics 26:1752-1758 · doi:10.1093/bioinformatics/btq257
[92] Scornet E (2015a) On the asymptotics of random forests. J Multivar Anal 146:72-83 · Zbl 1337.62063
[93] Scornet E (2015b) Random forests and kernel methods. IEEE Trans Inform Theory 62:1485-1500 · Zbl 1359.94969
[94] Scornet E, Biau G, Vert J-P (2015) Consistency of random forests. Ann Stat 43:1716-1741 · Zbl 1317.62028 · doi:10.1214/15-AOS1321
[95] Segal MR (1988) Regression trees for censored data. Biometrics 44:35-47 · Zbl 0707.62224 · doi:10.2307/2531894
[96] Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A (2011) Real-time human pose recognition in parts from single depth images. In: IEEE conference on computer vision and pattern recognition, pp 1297-1304 · Zbl 0278.60007
[97] Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595-645 · Zbl 0366.62051 · doi:10.1214/aos/1176343886
[98] Stone CJ (1980) Optimal rates of convergence for nonparametric estimators. Ann Stat 8:1348-1360 · Zbl 0451.62033 · doi:10.1214/aos/1176345206
[99] Stone CJ (1982) Optimal global rates of convergence for nonparametric regression. Ann Stat 10:1040-1053 · Zbl 0511.62048 · doi:10.1214/aos/1176345969
[100] Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307 · doi:10.1186/1471-2105-9-307
[101] Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inform Comput Sci 43:1947-1958 · doi:10.1021/ci034160g
[102] Toloşi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27:1986-1994 · Zbl 1235.93089 · doi:10.1093/bioinformatics/btr300
[103] Truong AKY (2009) Fast growing and interpretable oblique trees via logistic regression models. PhD thesis, University of Oxford, Oxford
[104] Varian H (2014) Big data: new tricks for econometrics. J Econ Perspect 28:3-28 · doi:10.1257/jep.28.2.3
[105] Wager S (2014) Asymptotic theory for random forests. arXiv:1405.0352
[106] Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res 15:1625-1651 · Zbl 1319.62132
[107] Watson GS (1964) Smooth regression analysis. Sankhy \[\bar{a}\] a¯ Ser A 26:359-372 · Zbl 0137.13002
[108] Welbl, J.; Jiang, X. (ed.); Hornegger, J. (ed.); Koch, R. (ed.), Casting random forests as artificial neural networks and profiting from it, 765-771 (2014), Berlin
[109] Winham SJ, Freimuth RR, Biernacka JM (2013) A weighted random forests approach to improve predictive performance. Stat Anal Data Mining ASA Data Sci J 6:496-505 · Zbl 1281.62238 · doi:10.1002/sam.11196
[110] Yan D, Chen A, Jordan MI (2013) Cluster forests. Comput Stat Data Anal 66:178-192 · Zbl 1471.62225 · doi:10.1016/j.csda.2013.04.010
[111] Yang F, Wang J, Fan G (2010) Kernel induced random survival forests. arXiv:1008.3952
[112] Yi Z, Soatto S, Dewan M, Zhan Y (2012) Information forests. In: 2012 information theory and applications workshop, pp 143-146
[113] Zhu R, Zeng D, Kosorok MR (2015) Reinforcement learning trees. J Am Stat Assoc 110(512):1770-1784 · Zbl 1374.68466
[114] Ziegler A, König IR (2014) Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev Data Mining Knowl Discov 4:55-63 · doi:10.1002/widm.1114
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.