Bandwidth choice for nonparametric classification. (English) Zbl 1064.62075

Summary: It is shown that, for kernel-based classification with univariate distributions and two populations, optimal bandwidth choice has a dichotomous character. If the two densities cross at just one point, where their curvatures have the same signs, then minimum Bayes risk is achieved using bandwidths which are an order of magnitude larger than those which minimize pointwise estimation error. On the other hand, if the curvature signs are different, or if there are multiple crossing points, then bandwidths of conventional size are generally appropriate.
The range of different modes of behavior is narrower in multivariate settings. There, the optimal size of bandwidth is generally the same as that which is appropriate for pointwise density estimation. These properties motivate empirical rules for bandwidth choice.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62G07 Density estimation
62C12 Empirical decision procedures; empirical Bayes procedures
62F15 Bayesian inference
Full Text: DOI arXiv


[1] Ancukiewicz, M. (1998). An unsupervised and nonparametric classification procedure based on mixtures with known weights. J. Classification 15 129–141. · Zbl 0899.62078 · doi:10.1007/s003579900023
[2] Baek, S. and Sung, K. M. (2000). Fast \(K\)-nearest-neighbour search algorithm for nonparametric classification. Electronics Letters 36 1821–1822.
[3] Breiman, L. (1998). Arcing classifiers (with discussion). Ann. Statist. 26 801–849. · Zbl 0934.62064 · doi:10.1214/aos/1024691079
[4] Breiman, L. (2001). Random forests. Machine Learning 45 5–32. · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[5] Chanda, K. C. and Ruymgaart, F. H. (1989). Asymptotic estimate of probability of misclassification for discriminant rules based on density estimates. Statist. Probab. Lett. 8 81–88. · Zbl 0668.62040 · doi:10.1016/0167-7152(89)90088-6
[6] Cover, T. M. (1968). Rates of convergence for nearest neighbor procedures. In Proc. Hawaii International Conference on System Sciences (B. K. Kinariwala and F. F. Kuo, eds.) 413–415. Univ. Hawaii Press, Honolulu.
[7] Devroye, L. (1982). Any discrimination rule can have an arbitrarily bad probability of error for finite sample size. IEEE Trans. Pattern Anal. Machine Intelligence 4 154–157. · Zbl 0484.62072 · doi:10.1109/TPAMI.1982.4767222
[8] Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Springer, New York. · Zbl 0853.68150
[9] Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. J. Amer. Statist. Assoc. 97 77–87. · Zbl 1073.62576 · doi:10.1198/016214502753479248
[10] Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. J. Amer. Statist. Assoc. 78 316–331. · Zbl 0543.62079 · doi:10.2307/2288636
[11] Efron, B. and Tibshirani, R. (1997). Improvements on cross-validation: The \(.632+\) bootstrap method. J. Amer. Statist. Assoc. 92 548–560. · Zbl 0887.62044 · doi:10.2307/2965703
[12] Ehrenfeucht, A., Haussler, D., Kearns, M. and Valiant, L. (1989). A general lower bound on the number of examples needed for learning. Inform. and Comput. 82 247–261. Also published in Proc. 1988 Workshop on Computational Learning Theory (D. Haussler and L. Pitt, eds.) 139–154. Morgan Kaufmann, San Mateo, CA. · Zbl 0679.68158 · doi:10.1016/0890-5401(89)90002-3
[13] Fix, E. and Hodges, J. (1951). Discriminatory analysis. Nonparametric discrimination: Consistency properties. Technical Report No. 4, Project No. 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX. · Zbl 0715.62080
[14] Friedman, J. H. (1997). On bias, variance, \(0/1\)-loss, and the curse-of-dimensionality. Data Min. Knowl. Discov. 1 55–77.
[15] Friedman, J. H., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Ann. Statist. 28 337–407. · Zbl 1106.62323 · doi:10.1214/aos/1016218223
[16] Fukunaga, K. and Flick, T. E. (1984). Classification error for a very large number of classes. IEEE Trans. Pattern Anal. Machine Intelligence 6 779–788. · Zbl 0561.62056 · doi:10.1109/TPAMI.1984.4767601
[17] Fukunaga, K. and Hummels, D. M. (1987). Bias of nearest neighbor estimates. IEEE Trans. Pattern Anal. Machine Intelligence 9 103–112. · Zbl 0623.62055 · doi:10.1109/TPAMI.1987.4767875
[18] Györfi, L., Kohler, M., Krzyżak, A. and Walk, H. (2002). A Distribution-Free Theory of Nonparametric Regression . Springer, New York. · Zbl 1021.62024
[19] Hall, P. (1983). Large sample optimality of least squares cross-validation in density estimation. Ann. Statist. 11 1156–1174. JSTOR: · Zbl 0599.62051
[20] Hall, P. and Kang, K.-H. (2002). Effect of bandwidth choice on Bayes risk in nonparametric classification. Available at http://stats.hufs.ac.kr/\(\sim\)khkang.
[21] Hall, P. and Schucany, W. R. (1989). A local cross-validation algorithm. Statist. Probab. Lett. 8 109–117. · Zbl 0676.62038 · doi:10.1016/0167-7152(89)90002-3
[22] Härdle, W. and Kelly, G. (1987). Nonparametric kernel regression estimation—optimal choice of bandwidth. Statistics 18 21–35. · Zbl 0622.62043 · doi:10.1080/02331888708801986
[23] Holmström, L. and Klemelä, J. (1992). Asymptotic bounds for the expected \(L^1\) error of a multivariate kernel density estimator. J. Multivariate Anal. 42 245–266. · Zbl 0754.62021 · doi:10.1016/0047-259X(92)90046-I
[24] Jiang, W. X. (2002). On weak base hypotheses and their implications for boosting regression and classification Ann. Statist. 30 51–73. · Zbl 1012.62066 · doi:10.1214/aos/1015362184
[25] Kharin, Yu. S. (1983). Analysis and optimization of Rosenblatt–Parzen classifier with the aid of asymptotic expansions. Automat. Remote Control 44 72–80. · Zbl 0525.93057
[26] Kharin, Yu. S. and Ducinskas, K. (1979). The asymptotic expansion of the risk for classifiers using maximum likelihood estimates. Statist. Problemy Upravleniya—Trudy Sem. Protsessy Optimal. Upravleniya V Sektsiya 38 77–93. (In Russian.) · Zbl 0435.62058
[27] Kim, H. and Loh, W.-Y. (2001). Classification trees with unbiased multiway splits. J. Amer. Statist. Assoc. 96 589–604. · doi:10.1198/016214501753168271
[28] Krzyżak, A. (1991). On exponential bounds on the Bayes risk of the nonparametric classification rules. In Nonparametric Functional Estimation and Related Topics (G. Roussas, ed.) 347–360. Kluwer, Dordrecht. · Zbl 0728.62058
[29] Lapko, A. V. (1993). Nonparametric Classification Methods and Their Application . VO Nauka, Novosibirsk. (In Russian.) · Zbl 0883.62062
[30] Lin, C.-T. (2001). Nonparametric classification on two univariate distributions. Comm. Statist. Theory Methods 30 319–330. · Zbl 1009.62550 · doi:10.1081/STA-100002034
[31] Lugosi, G. and Nobel, A. (1996). Consistency of data-driven histogram methods for density estimation and classification. Ann. Statist. 24 687–706. · Zbl 0859.62040 · doi:10.1214/aos/1032894460
[32] Lugosi, G. and Pawlak, M. (1994). On the posterior-probability estimate of the error rate of nonparametric classification rules. IEEE Trans. Inform. Theory 40 475–481. · Zbl 0802.62062 · doi:10.1109/18.312167
[33] Mammen, E. and Tsybakov, A. B. (1999). Smooth discrimination analysis. Ann. Statist. 27 1808–1829. · Zbl 0961.62058 · doi:10.1214/aos/1017939240
[34] Marron, J. S. (1983). Optimal rates on convergence to Bayes risk in nonparametric discrimination. Ann. Statist. 11 1142–1155. JSTOR: · Zbl 0554.62053
[35] Mielniczuk, J., Sarda, P. and Vieu, P. (1989). Local data-driven bandwidth choice for density estimation. J. Statist. Plann. Inference 23 53–69. · Zbl 0689.62027 · doi:10.1016/0378-3758(89)90039-6
[36] Pawlak, M. (1993). Kernel classification rules from missing data. IEEE Trans. Inform. Theory 39 979–988. · Zbl 0797.62050 · doi:10.1109/18.256504
[37] Psaltis, D., Snapp, R. R. and Venkatesh, S. S. (1994). On the finite sample performance of the nearest neighbor classifier. IEEE Trans. Inform. Theory 40 820–837. · Zbl 0820.62060 · doi:10.1109/18.335893
[38] Schapire, R. E., Freund, Y., Bartlett, P. and Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Ann. Statist. 26 1651–1686. · Zbl 0929.62069 · doi:10.1214/aos/1024691352
[39] Steele, B. M. and Patterson, D. A. (2000). Ideal bootstrap estimation of expected prediction error for \(k\)-nearest neighbor classifiers: Applications for classification and error assessment. Statist. Comput. 10 349–355.
[40] Stoller, D. S. (1954). Univariate two-population distribution-free discrimination. J. Amer. Statist. Assoc. 49 770–777. · Zbl 0057.11605 · doi:10.2307/2281538
[41] Stone, C. J. (1984). An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist. 12 1285–1297. JSTOR: · Zbl 0599.62052 · doi:10.1214/aos/1176346792
[42] Yang, Y. H. (1999a). Minimax nonparametric classification. I. Rates of convergence. IEEE Trans. Inform. Theory 45 2271–2284. · Zbl 0962.62026 · doi:10.1109/18.796368
[43] Yang, Y. H. (1999b). Minimax nonparametric classification. II. Model selection for adaptation. IEEE Trans. Inform. Theory 45 2285–2292. · Zbl 0962.62027 · doi:10.1109/18.796369
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.