Random forests, decision trees, and categorical predictors: the “absent levels” problem.

*(English)*Zbl 06982336Summary: One advantage of decision tree based methods like random forests is their ability to natively handle categorical predictors without having to first transform them (e.g., by using feature engineering techniques). However, in this paper, we show how this capability can lead to an inherent “absent levels” problem for decision tree based methods that has never been thoroughly discussed, and whose consequences have never been carefully explored. This problem occurs whenever there is an indeterminacy over how to handle an observation that has reached a categorical split which was determined when the observation in question’s level was absent during training. Although these incidents may appear to be innocuous, by using Leo Breiman and Adele Cutler’s random forests FORTRAN code and the randomForestR package [A. Liaw and M. Wiener, “Classification and regression by randomForest”, R News 2, No. 3, 18–22 (2002)] as motivating case studies, we examine how overlooking the absent levels problem can systematically bias a model. Furthermore, by using three real data examples, we illustrate how
absent levels can dramatically alter a model’s performance in practice, and we empirically demonstrate how some simple heuristics can be used to help mitigate the effects of the absent levels problem until a more robust theoretical solution is found.

Reviewer: Reviewer (Berlin)

##### MSC:

68Q25 | Analysis of algorithms and problem complexity |

62H30 | Classification and discrimination; cluster analysis (statistical aspects) |

62J02 | General nonlinear regression |

##### Software:

C4.5; CRAN; ElemStatLearn; gbm; GitHub; partykit; Quantregforest; randomForest; rpart; Rvoteview; Scikit; UCI-ml
PDF
BibTeX
XML
Cite

\textit{T. C. Au}, J. Mach. Learn. Res. 19, Paper No. 45, 30 p. (2018; Zbl 06982336)

Full Text:
Link

##### References:

[1] | Yali Amit and Donald Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9:1545–1588, 1997. |

[2] | G´erard Biau. Analysis of a random forests model. Journal of Machine Learning Research, 13(1):1063–1095, 2012. · Zbl 1283.62127 |

[3] | G´erard Biau, Luc Devroye, and G´abor Lugosi. Consistency of random forests and other averaging classifiers. Journal of Machine Learning Research, 9(1):2015–2033, 2008. · Zbl 1225.62081 |

[4] | Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996a. · Zbl 0858.68080 |

[5] | Leo Breiman. Out-of-bag estimation. Technical report, Department of Statistics, U.C. Berkeley, 1996b.URL https://www.stat.berkeley.edu/ breiman/OOBestimation. pdf. · Zbl 0849.68095 |

[6] | Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. · Zbl 1007.68152 |

[7] | Leo Breiman. Manual—setting up, using, and understanding random forest v4.0. 2003. URL https://www.stat.berkeley.edu/ breiman/Using_random_forests_v4.0.pdf. 28 |

[8] | Leo Breiman, Jerome. H. Friedman, Richard. A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth, 1984. ISBN 0-534-98053-8. · Zbl 0541.62042 |

[9] | Jacob Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20:37–46, 1960. |

[10] | Jesse Davis and Mark Goadrich. The relationship between precision-recall and ROC curves. In ICML ’06: Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006. |

[11] | Misha Denil, David Matheson, and Nando de Freitas. Narrowing the gap: Random forests in theory and in practice. In Proceedings of the 31th International Conference on Machine Learning, pages 665–673, 2014. |

[12] | Walter D. Fisher. On grouping for maximum homogeneity. Journal of the American Statistical Association, 53(284):789–798, 1958. · Zbl 0084.35904 |

[13] | Trevor J Hastie, Robert J. Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer, New York, 2009. · Zbl 1273.62005 |

[14] | Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998. |

[15] | Torsten Hothorn and Achim Zeileis. partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16:3905–3909, 2015. · Zbl 1351.62005 |

[16] | Torsten Hothorn, Kurt Hornik, and Achim Zeileis. Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15 (3):651–674, 2006. |

[17] | Hemant Ishwaran, Udaya B. Kogalur, Eugene H. Blackstone, and Michael S. Lauer. Random survival forests. Annals of Applied Statistics, 2(3):841–860, 2008. · Zbl 1149.62331 |

[18] | Jeff Lewis. Rvoteview: Voteview Data in R, 2015. https://github.com/JeffreyBLewis/ Rvoteview, http://voteview.polisci.ucla.edu, http://voteview.com. |

[19] | Andy Liaw and Matthew Wiener. Classification and regression by randomForest. R News, 2(3):18–22, 2002. |

[20] | Moshe Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci. edu/ml. |

[21] | Wei-Yin Loh and Nunta Vanichsetakul. Tree-structured classification via generalized discriminant analysis. Journal of the American Statistical Association, 83(403):715–725, 1988. · Zbl 0649.62055 |

[22] | Nolan M. McCarty, Keith T. Poole, and Howard Rosenthal. Income Redistribution and the Realignment of American Politics. AEI Press, publisher for the American Enterprise Institute, 1997. 29 |

[23] | Nicolai Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983–999, 2006. · Zbl 1222.68262 |

[24] | Nicolai Meinshausen. quantregForest: Quantile Regression Forests, 2012. URL http:// CRAN.R-project.org/package=quantregForest. R package version 0.2-3. |

[25] | Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ´Edouard Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. · Zbl 1280.68189 |

[26] | John R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993. |

[27] | Greg Ridgeway. gbm: Generalized Boosted Regression Models, 2013. URL http://CRAN. R-project.org/package=gbm. R package version 2.1. |

[28] | Brian D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996. ISBN 0-521-46086-7. · Zbl 1163.62047 |

[29] | Maytal Saar-Tsechansky and Foster Provost. Handling missing values when applying classification models. Journal of Machine Learning Research, 8:1623–1657, 2007. · Zbl 1222.68295 |

[30] | Terry Therneau, Beth Atkinson, and Brian Ripley. rpart: Recursive Partitioning and Regression Trees, 2015. URL http://CRAN.R-project.org/package=rpart. R package version 4.1-10. |

[31] | Stefan Wager, Trevor Hastie, and Bradley Efron. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of Machine Learning Research, 15 (1):1625–1651, 2014. · Zbl 1319.62132 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.