Optimal classification trees.

*(English)*Zbl 06841416Summary: State-of-the-art decision tree methods apply heuristics recursively to create each split in isolation, which may not capture well the underlying characteristics of the dataset. The optimal decision tree problem attempts to resolve this by creating the entire decision tree at once to achieve global optimality. In the last 25 years, algorithmic advances in integer optimization coupled with hardware improvements have resulted in an astonishing 800 billion factor speedup in mixed-integer optimization (MIO). Motivated by this speedup, we present optimal classification trees, a novel formulation of the decision tree problem using modern MIO techniques that yields the optimal decision tree for axes-aligned splits. We also show the richness of this MIO formulation by adapting it to give optimal classification trees with hyperplanes that generates optimal decision trees with multivariate splits. Synthetic tests demonstrate that these methods recover the true decision tree more closely than heuristics, refuting the notion that optimal methods overfit the training data. We comprehensively benchmark these methods on a sample of 53 datasets from the UCI machine learning repository. We establish that these MIO methods are practically solvable on real-world datasets with sizes in the 1000s, and give average absolute improvements in out-of-sample accuracy over CART of \(1-2\) and \(3-5\%\) for the univariate and multivariate cases, respectively. Furthermore, we identify that optimal classification trees are likely to outperform CART by \(1.2-1.3\%\) in situations where the CART accuracy is high and we have sufficient training data, while the multivariate version outperforms CART by \(4-7\%\) when the CART accuracy or dimension of the dataset is low.

##### MSC:

68T05 | Learning and adaptive systems in artificial intelligence |

PDF
BibTeX
XML
Cite

\textit{D. Bertsimas} and \textit{J. Dunn}, Mach. Learn. 106, No. 7, 1039--1082 (2017; Zbl 06841416)

Full Text:
DOI

##### References:

[1] | Arthanari, T., & Dodge, Y. (1981). Mathematical programming in statistics (Vol. 341). New York: Wiley. · Zbl 0549.62002 |

[2] | Auer, P., Holte, R. C., & Maass, W. (1995). Theory and applications of agnostic pac-learning with small decision trees. In Proceedings of the 12th international conference on machine learning (pp. 21-29). |

[3] | Bennett, K. P. (1992). Decision tree construction via linear programming. In M. Evans (Ed.), Proceedings of the 4th midwest artificial intelligence and cognitive science society conference (pp. 97-101). |

[4] | Bennett, K. P., & Blue, J. (1996). Optimal decision trees. Rensselaer Polytechnic Institute Math Report No. 214. |

[5] | Bennett, K. P., & Blue, J. A. (1998). A support vector machine approach to decision trees. In IEEE international joint conference on neural networks proceedings. IEEE world congress on computational intelligence (Vol. 3, pp. 2396-2401). |

[6] | Bertsimas, D; King, A, An algorithmic approach to linear regression, Operations Research, 64, 2-16, (2015) · Zbl 1338.90272 |

[7] | Bertsimas, D., & King, A. (2017). Logistic regression: From art to science. Statistical Science (to appear). · Zbl 1442.62166 |

[8] | Bertsimas, D; Mazumder, R, Least quantile regression via modern optimization, The Annals of Statistics, 42, 2494-2525, (2014) · Zbl 1302.62154 |

[9] | Bertsimas, D; Shioda, R, Classification and regression via integer optimization, Operations Research, 55, 252-271, (2007) · Zbl 1167.90593 |

[10] | Bertsimas, D., & Weismantel, R. (2005). Optimization over integers. Belmont, MA: Dynamic Ideas. |

[11] | Bertsimas, D; King, A; Mazumder, R, Best subset selection via a modern optimization Lens, Annals of Statistics, 44, 813-852, (2016) · Zbl 1335.62115 |

[12] | Bezanson, J., Edelman, A., Karpinski, S., & Shah, V. B. (2014). Julia: A fresh approach to numerical computing. arXiv preprint arXiv:1411.1607 · Zbl 1356.68030 |

[13] | Bixby, R. E. (2012). A brief history of linear and mixed-integer programming computation. Documenta Mathematica, Extra Volume: Optimization Stories, 107-121. · Zbl 1270.90003 |

[14] | Breiman, L, Random forests, Machine Learning, 45, 5-32, (2001) · Zbl 1007.68152 |

[15] | Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks. · Zbl 0541.62042 |

[16] | Cox, LA; Yuping, Q; Kuehner, W, Heuristic least-cost computation of discrete classification functions with uncertain argument values, Annals of Operations Research, 21, 1-29, (1989) · Zbl 0705.90089 |

[17] | Esmeir, S; Markovitch, S, Anytime learning of decision trees, The Journal of Machine Learning Research, 8, 891-933, (2007) · Zbl 1222.68194 |

[18] | Gurobi Optimization Inc. (2015a). Gurobi 6.0 performance benchmarks. http://www.gurobi.com/pdfs/benchmarks.pdf. Accessed September 5, 2015. |

[19] | Gurobi Optimization Inc. (2015b). Gurobi optimizer reference manual. http://www.gurobi.com. |

[20] | Heath, D., Kasif, S., & Salzberg, S. (1993). Induction of oblique decision trees. In IJCAI, Citeseer (pp. 1002-1007). |

[21] | Hyafil, L; Rivest, RL, Constructing optimal binary decision trees is NP-complete, Information Processing Letters, 5, 15-17, (1976) · Zbl 0333.68029 |

[22] | IBM ILOG CPLEX. (2014). V12.1 users manual. https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/. |

[23] | Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News, \(2\)(3), 18-22. http://CRAN.R-project.org/doc/Rnews/. |

[24] | Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml. |

[25] | Loh, WY; Shih, YS, Split selection methods for classification trees, Statistica Sinica, 7, 815-840, (1997) · Zbl 1067.62545 |

[26] | López-Chau, A; Cervantes, J; López-García, L; Lamont, FG, Fishers decision tree, Expert Systems with Applications, 40, 6283-6291, (2013) |

[27] | Lubin, M; Dunning, I, Computing in operations research using Julia, INFORMS Journal on Computing, 27, 238-248, (2015) · Zbl 1331.90001 |

[28] | Murthy, S., & Salzberg, S. (1995a). Lookahead and pathology in decision tree induction. In IJCAI, Citeseer (pp. 1025-1033). |

[29] | Murthy, S. K., & Salzberg, S. (1995b). Decision tree induction: How effective is the greedy heuristic? In KDD (pp. 222-227). |

[30] | Murthy, SK; Kasif, S; Salzberg, S, A system for induction of oblique decision trees, Journal of Artificial Intelligence Research, 2, 1-32, (1994) · Zbl 0900.68335 |

[31] | Nemhauser, G. L. (2013). Integer programming: The global impact. In Presented at EURO, INFORMS, Rome, Italy, 2013. http://euro-informs2013.org/data/http_/euro2013.org/wp-content/uploads/nemhauser.pdf. Accessed September 9, 2015. |

[32] | Norouzi, M., Collins, M. D., Johnson, M. A., Fleet, D. J., & Kohli, P. (2015). Efficient non-greedy optimization of decision trees. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Proceedings of the Advances in Neural Information Processing Systems 2015, 28: Annual Conference on Neural Information Processing Systems, 7-12 December 2015, Montreal, QC, pp. 1729-1737. |

[33] | Norton, S. W. (1989). Generating better decision trees. In IJCAI (Vol. 89, pp. 800-805). · Zbl 0709.68085 |

[34] | Payne, HJ; Meisel, WS, An algorithm for constructing optimal binary decision trees, IEEE Transactions on Computers, 100, 905-916, (1977) · Zbl 0363.68131 |

[35] | Quinlan, JR, Induction of decision trees, Machine Learning, 1, 81-106, (1986) |

[36] | Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Francisco, CA: Morgan Kaufmann. |

[37] | R Core Team. (2015). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. http://www.R-project.org/. |

[38] | Son, NH, From optimal hyperplanes to optimal decision trees, Fundamenta Informaticae, 34, 145-174, (1998) · Zbl 0903.68161 |

[39] | Therneau, T., Atkinson, B., & Ripley, B. (2015). rpart: Recursive partitioning and regression trees. http://CRAN.R-project.org/package=rpart, R package version 4.1-9. |

[40] | Tjortjis, C., & Keane, J. (2002). T3: A classification algorithm for data mining. Lecture Notes in Computer Science (Vol. 2412, pp. 50-55). Berlin: Springer. · Zbl 1020.68916 |

[41] | Top500 Supercomputer Sites. (2015). Performance development. http://www.top500.org/statistics/perfdevel/. Accessed September 4, 2015. |

[42] | Truong, A. (2009). Fast growing and interpretable oblique trees via logistic regression models. Ph.D. thesis, University of Oxford. |

[43] | Tzirakis, P., & Tjortjis, C. (2016). T3c: Improving a decision tree classification algorithms interval splits on continuous attributes. Advances in Data Analysis and Classification, 1-18. |

[44] | Wickramarachchi, D; Robertson, B; Reale, M; Price, C; Brown, J, Hhcart: an oblique decision tree, Computational Statistics & Data Analysis, 96, 12-23, (2016) · Zbl 06918560 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.