×

Confidence sets for split points in decision trees. (English) Zbl 1117.62037

Summary: We investigate the problem of finding confidence sets for split points in decision trees (CART). Our main results establish the asymptotic distribution of the least squares estimators and some associated residual sum of squares statistics in a binary decision tree approximation to a smooth regression curve. Cube-root asymptotics with non-normal limit distributions are involved. We study various confidence sets for the split point, one calibrated using the subsampling bootstrap, and others calibrated using plug-in estimates of some nuisance parameters. The performance of the confidence sets is assessed in a simulation study. A motivation for developing such confidence sets comes from the problem of phosphorus pollution in the Everglades. Ecologists have suggested that split points provide a phosphorus threshold at which biological imbalance occurs, and the lower endpoint of the confidence set may be interpreted as a level that is protective of the ecosystem. This is illustrated using data from a Duke University Wetlands Center phosphorus dosing study in the Everglades.

MSC:

62G08 Nonparametric regression and quantile regression
62E20 Asymptotic distribution theory in statistics
62G15 Nonparametric tolerance and confidence regions
62P12 Applications of statistics to environmental and related topics
62G20 Asymptotic properties of nonparametric inference
PDF BibTeX XML Cite
Full Text: DOI arXiv

References:

[1] Antoniadis, A. and Gijbels, I. (2002). Detecting abrupt changes by wavelet methods. J. Nonparametr. Statist. 14 7–29. · Zbl 1017.62033
[2] Banerjee, M. and Wellner, J. A. (2001). Likelihood ratio tests for monotone functions. Ann. Statist. 29 1699–1731. · Zbl 1043.62037
[3] Bühlmann, P. and Yu, B. (2002). Analyzing bagging. Ann. Statist. 30 927–961. · Zbl 1029.62037
[4] Delgado, M. A., Rodríguez-Poo, J. and Wolf, M. (2001). Subsampling inference in cube root asymptotics with an application to Manski’s maximum score statistic. Econom. Lett. 73 241–250. · Zbl 1056.91546
[5] Dempfle, A. and Stute, W. (2002). Nonparametric estimation of a discontinuity in regression. Statist. Neerlandica 56 233–242. · Zbl 1076.62520
[6] Fan, J. and Gijbels, I. (1996). Local Polynomial Modelling and Its Applications . Chapman and Hall, London. · Zbl 0873.62037
[7] Ferger, D. (2004). A continuous mapping theorem for the argmax-functional in the non-unique case. Statist. Neerlandica 58 83–96. · Zbl 1090.60032
[8] Genovese, C. R. and Wasserman, L. (2005). Confidence sets for nonparametric wavelet regression. Ann. Statist. 33 698–729. · Zbl 1068.62057
[9] Gijbels, I., Hall, P. and Kneip, A. (1999). On the estimation of jump points in smooth curves. Ann. Inst. Statist. Math. 51 231–251. · Zbl 0934.62035
[10] Groeneboom, P. and Wellner, J. A. (2001). Computing Chernoff’s distribution. J. Comput. Graph. Statist. 10 388–400. JSTOR: · Zbl 04567029
[11] Kim, J. and Pollard, D. (1990). Cube root asymptotics. Ann. Statist. 18 191–219. · Zbl 0703.62063
[12] Lund, R. and Reeves, J. (2002). Detection of undocumented changepoints: A revision of the two-phase regression model. J. Climate 15 2547–2554.
[13] Payne, G., Weaver, K. and Bennett, T. (2003). Development of a numeric phosphorus criterion for the Everglades Protection Area. Everglades Consolidated Report. Chapter 5. Available at www.dep.state.fl.us/water/wqssp/everglades/consol_rpt.htm.
[14] Politis, D. N. and Romano, J. P. (1994). Large sample confidence regions based on subsamples under minimal assumptions. Ann. Statist. 22 2031–2050. · Zbl 0828.62044
[15] Politis, D. N., Romano, J. P. and Wolf, M. (1999). Subsampling . Springer, New York. · Zbl 0931.62035
[16] Qian, S. S., King, R. and Richardson, C. J. (2003). Two statistical methods for the detection of environmental thresholds. Ecological Modelling 166 87–97.
[17] Qian, S. S. and Lavine, M. (2003). Setting standards for water quality in the Everglades. Chance 16 10–16.
[18] Ruppert, D. (1997). Empirical-bias bandwidths for local polynomial nonparametric regression and density estimation. J. Amer. Statist. Assoc. 92 1049–1062. JSTOR: · Zbl 1067.62531
[19] Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Ann. Statist. 9 1135–1151. · Zbl 0476.62035
[20] Thomson, R. E. and Fine, I. V. (2003). Estimating mixed layer depth from oceanic profile data. J. Atmospheric and Oceanic Technology 20 319–329.
[21] van der Vaart, A. and Wellner, J. A. (1996). Weak Convergence and Empirical Processes . With Applications to Statistics . Springer, New York. · Zbl 0862.60002
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.