×

Improved nearest neighbor classifiers by weighting and selection of predictors. (English) Zbl 1505.62404

Summary: Nearest neighborhood classification is a flexible classification method that works under weak assumptions. The basic concept is to use the weighted or un-weighted sums over class indicators of observations in the neighborhood of the target value. Two modifications that improve the performance are considered here. Firstly, instead of using weights that are solely determined by the distances we estimate the weights by use of a logit model. By using a selection procedure like lasso or boosting the relevant nearest neighbors are automatically selected. Based on the concept of estimation and selection, in the second step, we extend the predictor space. We include nearest neighborhood counts, but also the original predictors themselves and nearest neighborhood counts that use distances in sub dimensions of the predictor space. The resulting classifiers combine the strength of nearest neighbor methods with parametric approaches and by use of sub dimensions are able to select the relevant features. Simulations and real data sets demonstrate that the method yields better misclassification rates than currently available nearest neighborhood methods and is a strong and flexible competitor in classification problems.

MSC:

62-08 Computational methods for problems pertaining to statistics
62H30 Classification and discrimination; cluster analysis (statistical aspects)
68T05 Learning and adaptive systems in artificial intelligence
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Bache, K., Lichman, M.: Uci machine learning repository. http://archive.ics.uci.edu/ml19 (2013) · Zbl 1507.62170
[2] Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123-140 (1996a) · Zbl 0858.68080
[3] Breiman, L.: Heuristics of instability and stabilisation in model selection. Ann. Stat. 24, 2350-2383 (1996b) · Zbl 0867.62055
[4] Breiman, L.: Stacked regressions. Mach. Learn. 24(1), 49-64 (1996c) · Zbl 0849.68104
[5] Breiman, L.: Random forests. Mach. Learn. 45(1), 5-32 (2001) · Zbl 1007.68152
[6] Bühlmann, P., Hothorn, T.: Boosting algorithms: regularization, prediction and model fitting (with discussion). Stat. Sci. 22, 477-505 (2007) · Zbl 1246.62163
[7] Bühlmann, P., Yu, B.: Boosting with the L2 loss: regression and classification. J. Am. Stat. Assoc. 98, 324-339 (2003) · Zbl 1041.62029
[8] Candes, E., Tao, T.: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35(6), 2313-2351 (2007) · Zbl 1139.62019
[9] Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273-297 (1995) · Zbl 0831.68098
[10] Domeniconi, C., Peng, J., Gunopulos, D.: Locally adaptive metric nearest-neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1281-1285 (2002)
[11] Domeniconi, C., Yan, B.: Nearest neighbor ensemble. In: Proceedings of the 17th International Conference on Pattern Recognition, vol. 1, pp. 228-231 (2004) · Zbl 0366.62051
[12] Fan, J., Li, R.: Variable selection via nonconcave penalize likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348-1360 (2001) · Zbl 1073.62547
[13] Fix, E., Hodges, J.L.: Discriminatory Analysis-nonparametric Discrimination: Consistency Properties. US Air Force School of Aviation Medicine, Randolph Field (1951) · Zbl 0715.62080
[14] Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1-22 (2010)
[15] Friedman, J.H.: Flexible metric nearest neighbor classification. Technical report 113, Stanford University, Statistics Department (1994)
[16] Friedman, J.H., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of boosting. Ann. Stat. 28, 337-407 (2000) · Zbl 1106.62323
[17] Gertheiss, J., Tutz, G.: Feature selection and weighting by nearest neighbor ensembles. Chemom. Intell. Lab. Syst. 99, 30-38 (2009)
[18] Ghosh, A.K.: On nearest neighbor classification using adaptive choice of k. J. Comput. Graph. Stat. 16(2), 482-502 (2007)
[19] Ghosh, A.K.: A probabilistic approach for semi-supervised nearest neighbor classification. Pattern Recognit. Lett. 33(9), 1127-1133 (2012)
[20] Ghosh, A.K., Godtliebsen, F.: On hybrid classification using model assisted posterior estimates. Pattern Recognit. 45(6), 2288-2298 (2012) · Zbl 1234.68340
[21] Goeman, J. J.: Penalized: weighted k-nearest neighbors. R package version 0.9-42 (2012)
[22] Hall, P., Park, B.U., Samworth, R.J.: Choice of neighbor order in nearest-neighbor classification. Ann. Stat. 36, 2135-2152 (2008) · Zbl 1274.62421
[23] Hastie, T., Tibshirani, R.: Discriminant adaptive nearest-neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607-616 (1996)
[24] Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning, 2nd edn. Springer, New York (2009) · Zbl 1273.62005
[25] Holmes, C., Adams, N.: A probabilistic nearest neighbour method for statistical pattern recognition. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 64(2), 295-306 (2002) · Zbl 1059.62065
[26] Holmes, C.C., Adams, N.M.: Likelihood inference in nearest-neighbour classification models. Biometrika 90(1), 99-112 (2003) · Zbl 1034.62053
[27] Hothorn, T.: TH.data: TH’s data archive. R package version 1.0-3 (2014) · Zbl 1106.62323
[28] Hothorn, T., Buehlmann, P., Kneib, T., Schmid, M., Hofner, B.: Mboost: model-based boosting. R package version 2.2-3 (2013)
[29] Leisch, F., Dimitriadou, E.: mlbench: Machine learning benchmark problems. R package version 2.1-1 (2010)
[30] Liaw, A., Wiener, M.: Classification and regression by randomforest. R News 2(3), 18-22 (2002)
[31] Lin, Y., Jeon, Y.: Random forests and adaptive nearest neighbors. J. Am. Stat. Assoc. 101, 578-590 (2006) · Zbl 1119.62304
[32] Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., Leisch, F.: e1071: Misc functions of the department of statistics (e1071), TU Wien. R package version 1.6-2 (2014) · Zbl 1041.62029
[33] Morin, R.L., Raeside, D.E.: A reappraisal of distance-weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans. Syst. Man Cybern. 11, 241-243 (1981)
[34] Nadaraya, E.A.: On estimating regression. Theory Probab. Appl. 10, 186-190 (1964) · Zbl 0136.40902
[35] Paik, M., Yang, Y.: Combining nearest neighbor classifiers versus cross-validation selection. Stat. Appl. Genet. Mol. Biol. 3(12), 1-19 (2004) · Zbl 1072.62111
[36] Park, M.Y., Hastie, T.: An l1 regularization-path algorithm for generalized linear models. J. R. Stat. Soc. B 69, 659-677 (2007) · Zbl 07555370
[37] Parthasarthy, G., Chatterji, B.N.: A class of new knn methods for low sample problems. IEEE Trans. Syst. Man Cybern. 20, 715-718 (1990)
[38] Pößnecker, W.: MRSP: multinomial response models with structured penalties. R package version 0.4.2 (2014)
[39] R Core Team: R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing (2013) · Zbl 0858.68080
[40] Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996) · Zbl 0853.62046
[41] Schliep, K., Hechenbichler, K.: kknn: Weighted k-nearest neighbors. R package version 1.2-3 (2013)
[42] Silverman, B.W., Jones, M.C.: Commentary on fix and hodges (1951): an important contribution to nonparametric discriminant analysis and density estimation. Int. Stat. Rev. 57, 233-238 (1989) · Zbl 0715.62079
[43] Simonoff, J.S.: Smoothing Methods in Statistics. Springer, New York (1996) · Zbl 0859.62035
[44] Stone, C.J.: Consistent nonparametric regression (with discussion). Ann. Stat. 5, 595-645 (1977) · Zbl 0366.62051
[45] Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267-288 (1996) · Zbl 0850.62538
[46] Tibshirani, R., Chu, G., Narasimhan, B., Li, J.: samr: SAM: Significance analysis of microarrays. R package version 2.0 (2011)
[47] Tutz, G., Binder, H.: Generalized additive modeling with implicit variable selection by likelihood-based boosting. Biometrics 62, 961-971 (2006) · Zbl 1116.62075
[48] Tutz, G., Pössnecker, W., Uhlmann, L.: Variable selection in general multinomial logit models. Comput. Stat. Data Anal. 82, 207-222 (2015) · Zbl 1507.62170
[49] Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S (Fourth ed.). New York: Springer. ISBN 0-387-95457-0 (2002) · Zbl 1006.62003
[50] Watson, G.S.: Smooth regression analysis. Sankhyā, Ser. A 26, 359-372 (1964) · Zbl 0137.13002
[51] Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301-320 (2005) · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.