Loh, Wei-Yin Improving the precision of classification trees. (English) Zbl 1184.62109 Ann. Appl. Stat. 3, No. 4, 1710-1737 (2009). Summary: Besides serving as prediction models, classification trees are useful for finding important predictor variables and identifying interesting subgroups in the data. These functions can be compromised by weak split selection algorithms that have variable selection biases or that fail to search beyond local main effects at each node of the tree. The resulting models may include many irrelevant variables or select too few of the important ones. Either eventuality can lead to erroneous conclusions. Four techniques to improve the precision of the models are proposed and their effectiveness compared with that of other algorithms, including tree ensembles, on real and simulated data sets. Cited in 31 Documents MSC: 62H30 Classification and discrimination; cluster analysis (statistical aspects) 05C90 Applications of graph theory 65C60 Computational problems in statistics (MSC2010) Keywords:bagging; kernel density; discrimination; nearest neighbor; prediction; random forest; selection bias; variable selection Software:SAS/STAT; SAS; C4.5; THAID; Stata; rpart; randomForest × Cite Format Result Cite Review PDF Full Text: DOI arXiv References: [1] Amasyali, M. F. and Ersoy, O. (2008). CLINE: A new decision-tree family. IEEE Transactions on Neural Networks 19 356-363. [2] Atkinson, E. J. and Therneau, T. M. (2000). An introduction to recursive partitioning using the RPART routines. Technical report 61, Biostatistic Section, Mayo Clinic, Rochester, NY. [3] Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123-140. · Zbl 0867.62055 · doi:10.1214/aos/1032181158 [4] Breiman, L. (2001). Random forests. Mach. Learn. 45 5-32. · Zbl 1007.68152 · doi:10.1023/A:1010933404324 [5] Breiman, L., Friedman, J. H., Olshen, R. A. and Stone, C. J. (1984). Classification and Regression Trees . Wadsworth, Belmont. · Zbl 0541.62042 [6] Buttrey, S. E. and Karo, C. (2002). Using k-nearest-neighbor classification in the leaves of a tree. Comput. Statist. Data Anal. 40 27-37. · Zbl 0990.62050 · doi:10.1016/S0167-9473(01)00098-6 [7] Cantu-Paz, E. and Kamath, C. (2003). Inducing oblique decision trees with evolutionary algorithms. IEEE Transactions on Evolutionary Computation 7 54-68. [8] Clark, V. (2004). SAS/STAT 9.1 User’s Guide . SAS Publishing, Cary, NC. [9] Doyle, P. (1973). The use of Automatic Interaction Detector and similar search procedures. Operational Research Quarterly 24 465-467. · Zbl 0262.55001 · doi:10.1093/qmath/24.1.397 [10] Fan, G. (2008). Kernel-induced classification trees and random forests. Manuscript. [11] Gama, J. (2004). Functional trees. Mach. Learn. 55 219-250. · Zbl 1078.68699 · doi:10.1023/B:MACH.0000027782.67192.13 [12] Ghosh, A. K., Chaudhuri, P. and Sengupta, D. (2006). Classification using kernel density estimates: Multiscale analysis and visualization. Technometrics 48 120-132. · doi:10.1198/004017005000000391 [13] Heinz, G., Peterson, L. J., Johnson, R. W. and Kerk, C. J. (2003). Exploring relationships in body dimensions. Journal of Statistics Education 11 . Available at www.amstat.org/publications/jse/v11n2/datasets.heinz.html. [14] Hosmer, D. W. and Lemeshow, S. (2000). Applied Logistic Regression , 2nd ed. Wiley, New York. · Zbl 0967.62045 [15] Hothorn, T., Hornik, K. and Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. J. Comput. Graph. Statist. 15 651-674. · doi:10.1198/106186006X133933 [16] Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. J. Amer. Statist. Assoc. 96 589-604. · doi:10.1198/016214501753168271 [17] Kim, H. and Loh, W.-Y. (2003). Classification trees with bivariate linear discriminant node models. J. Comput. Graph. Statist. 12 512-530. · doi:10.1198/1061860032049 [18] Lee, T.-H. and Shih, Y.-S. (2006). Unbiased variable selection for classification trees with multivariate responses. Comput. Statist. Data Anal. 51 659-667. · Zbl 1157.62438 · doi:10.1016/j.csda.2006.02.015 [19] Li, X. B., Sweigart, J. R., Teng, J. T. C., Donohue, J. M., Thombs, L. A. and Wang, S. M. (2003). Multivariate decision trees using linear discriminants and tabu search. IEEE Transactions on Systems Man and Cybernetics Part A-Systems and Humans 33 194-205. [20] Li, Y. H., Dong, M. and Kothari, R. (2005). Classifiability-based omnivariate decision trees. IEEE Transactions on Neural Networks 16 1547-1560. [21] Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News 2 18-22. Available at http://CRAN.R-project.org/doc/Rnews/. [22] Lim, T.-S., Loh, W.-Y. and Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach. Learn. J. 40 203-228. · Zbl 0969.68669 · doi:10.1023/A:1007608224229 [23] Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statist. Sinica 12 361-386. · Zbl 0998.62042 [24] Loh, W.-Y. and Shih, Y.-S. (1997). Split selection methods for classification trees. Statist. Sinica 7 815-840. · Zbl 1067.62545 [25] Loh, W.-Y. and Vanichsetakul, N. (1988). Tree-structured classification via generalized discriminant analysis (with discussion). J. Amer. Statist. Assoc. 83 715-728. · Zbl 0649.62055 · doi:10.2307/2289295 [26] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models , 2nd ed. Chapman and Hall, London. · Zbl 0744.62098 [27] Morgan, J. N. and Messenger, R. C. (1973). THAID: A sequential analysis program for the analysis of nominal scale dependent variables. Technical report, Institute for Social Research, Univ. Michigan, Ann Arbor. · Zbl 0276.62074 [28] Morgan, J. N. and Sonquist, J. A. (1963). Problems in the analysis of survey data, and a proposal. J. Amer. Statist. Assoc. 58 415-434. · Zbl 0114.10103 · doi:10.2307/2283276 [29] Noh, H. G., Song, M. S. and Park, S. H. (2004). An unbiased method for constructing multilabel classification trees. Comput. Statist. Data Anal. 47 149-164. · Zbl 1429.62260 [30] Perlich, C., Provost, F. and Simonoff, J. S. (2003). Tree induction vs. logistic regression: A learning-curve analysis. J. Mach. Learn. Res. 4 211-255. · Zbl 1093.68088 · doi:10.1162/153244304322972694 [31] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning . Morgan Kaufmann, San Mateo. · Zbl 0900.68112 [32] StataCorp. (2003). Stata Statistical Software: Release 8.0 . Stata Corporation, College Station, TX. [33] Strobl, C., Boulesteix, A.-L. and Augustin, T. (2007). Unbiased split selection for classification trees based on the Gini index. Comput. Statist. Data Anal. 52 483-501. · Zbl 1452.62469 [34] Wilson, E. B. and Hilferty, M. M. (1931). The distribution of chi-square. Proc. Nat. Acad. Sci. USA 17 684-688. · Zbl 0004.36005 · doi:10.1073/pnas.17.12.684 [35] Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques , 2nd ed. Morgan Kaufmann, San Fransico, CA. · Zbl 1076.68555 [36] Yildlz, O. T. and Alpaydin, E. (2005). Linear discriminant trees. International Journal of Pattern Recognition and Artificial Intelligence 19 323-353. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.