Multivariate and functional classification using depth and distance. (English) Zbl 1414.62247

Summary: We construct classifiers for multivariate and functional data. Our approach is based on a kind of distance between data points and classes. The distance measure needs to be robust to outliers and invariant to linear transformations of the data. For this purpose we can use the bagdistance which is based on halfspace depth. It satisfies most of the properties of a norm but is able to reflect asymmetry when the class is skewed. Alternatively we can compute a measure of outlyingness based on the skew-adjusted projection depth. In either case we propose the DistSpace transform which maps each data point to the vector of its distances to all classes, followed by \(k\)-nearest neighbor (kNN) classification of the transformed data points. This combines invariance and robustness with the simplicity and wide applicability of kNN. The proposal is compared with other methods in experiments with real and simulated data.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
Full Text: DOI arXiv


[1] Alonso, A.; Casado, D.; Romo, J., Supervised classification for functional data: a weighted distance approach, Comput Stat Data Anal, 56, 2334-2346, (2012) · Zbl 1252.62061
[2] Bache K, Lichman M (2013) UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets.html
[3] Brys, G.; Hubert, M.; Rousseeuw, PJ, A robustification of independent component analysis, J Chemom, 19, 364-375, (2005)
[4] Brys, G.; Hubert, M.; Struyf, A., A robust measure of skewness, J Comput Gr Stat, 13, 996-1017, (2004) · Zbl 1223.62059
[5] Chen Y, Keogh E, Hu B, Begum N, Bagnall A, Mueen A, Batista GJ (2015) The UCR Time Series Classification Archive. http://www.cs.ucr.edu/ eamonn/time_series_data/
[6] Christmann, A.; Fischer, P.; Joachims, T., Comparison between various regression depth methods and the support vector machine to approximate the minimum number of misclassifications, Comput Stat, 17, 273-287, (2002) · Zbl 1010.62054
[7] Christmann, A.; Rousseeuw, PJ, Measuring overlap in logistic regression, Comput Stat Data Anal, 37, 65-75, (2001) · Zbl 1051.62065
[8] Claeskens, G.; Hubert, M.; Slaets, L.; Vakili, K., Multivariate functional halfspace depth, J Am Stat Assoc, 109, 411-423, (2014) · Zbl 1367.62162
[9] Cuesta-Albertos JA, Nieto-Reyes A (2010) Functional classification and the random Tukey depth: Practical issues. In: Borgelt C, Rodríguez GG, Trutschnig W, Lubiano MA, Angeles Gil M, Grzegorzewski P, Hryniewicz O (eds) Combining soft computing and statistical methods in data analysis Springer, Berlin Heidelberg, pp 123-130
[10] Cuesta-Albertos JA, Febrero-Bande M, Oviedo de la Fuente M (2015) The \(DD^G\)-classifier in the functional setting. arXiv:1501.00372v2 · Zbl 1422.62216
[11] Delaigle, A.; Hall, P.; Bathia, N., Componentwise classification and clustering of functional data, Biometrika, 99, 299-313, (2012) · Zbl 1244.62090
[12] Donoho D (1982) Breakdown properties of multivariate location estimators. Ph.D. Qualifying paper, Dept. Statistics, Harvard University, Boston
[13] Donoho, D.; Gasko, M., Breakdown properties of location estimates based on halfspace depth and projected outlyingness, Ann Stat, 20, 1803-1827, (1992) · Zbl 0776.62031
[14] Dutta, S.; Ghosh, A., On robust classification using projection depth, Ann Inst Stat Math, 64, 657-676, (2011) · Zbl 1237.62080
[15] Dyckerhoff, R.; Mozharovskyi, P., Exact computation of the halfspace depth, Comput Stat Data Anal, 98, 19-30, (2016) · Zbl 1468.62048
[16] Ferraty F, Vieu P (2006) Nonparametric functional data analysis: theory and practice. Springer, New York · Zbl 1119.62046
[17] Felipe JC, Traina AJM, Traina C (2005) Global warp metric distance: boosting content-based image retrieval through histograms. Proceedings of the Seventh IEEE International Symposium on Multimedia (ISM’05), p 8
[18] Fix E, Hodges JL (1951) Discriminatory analysis—nonparametric discrimination: Consistency properties. Technical Report 4 USAF School of Aviation Medicine, Randolph Field, Texas · Zbl 0715.62080
[19] Ghosh, A.; Chaudhuri, P., On maximum depth and related classifiers, Scand J Stat, 32, 327-350, (2005) · Zbl 1089.62075
[20] Hallin, M.; Paindaveine, D.; Šiman, M., Multivariate quantiles and multiple-output regression quantiles: from \(L_1\) optimization to halfspace depth, Ann Stat, 38, 635-669, (2010) · Zbl 1183.62088
[21] Hastie, T.; Buja, A.; Tibshirani, R., Penalized discriminant analysis, Ann Stat, 23, 73-102, (1995) · Zbl 0821.62031
[22] Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning, 2nd edn. Springer, New York · Zbl 1273.62005
[23] Hlubinka, D.; Gijbels, I.; Omelka, M.; Nagy, S., Integrated data depth for smooth functions and its application in supervised classification, Comput Stat, 30, 1011-1031, (2015) · Zbl 1329.65029
[24] Hubert, M.; Rousseeuw, PJ; Segaert, P., Multivariate functional outlier detection, Stat Methods Appl, 24, 177-202, (2015) · Zbl 1441.62124
[25] Hubert, M.; Veeken, S., Robust classification for skewed data, Adv Data Anal Classif, 4, 239-254, (2010) · Zbl 1284.62378
[26] Hubert, M.; Vandervieren, E., An adjusted boxplot for skewed distributions, Comput Stat Data Anal, 52, 5186-5201, (2008) · Zbl 1452.62074
[27] Hubert, M.; Driessen, K., Fast and robust discriminant analysis, Comput Stat Data Anal, 45, 301-320, (2004) · Zbl 1429.62247
[28] Jörnsten, R., Clustering and classification based on the \(L_1\) data depth, J Multivar Anal, 90, 67-89, (2004) · Zbl 1047.62064
[29] Koenker, R.; Bassett, G., Regression quantiles, Econometrica, 46, 33-50, (1978) · Zbl 0373.62038
[30] Lange, T.; Mosler, K.; Mozharovskyi, P., Fast nonparametric classification based on data depth, Stat Papers, 55, 49-69, (2014) · Zbl 1283.62128
[31] Li, B.; Yu, Q., Classification of functional data: a segmentation approach, Comput Stat Data Anal, 52, 4790-4800, (2008) · Zbl 1452.62992
[32] Li, J.; Cuesta-Albertos, J.; Liu, R., DD-classifier: nonparametric classification procedure based on DD-plot, J Am Stat Assoc, 107, 737-753, (2012) · Zbl 1261.62058
[33] Liu, R., On a notion of data depth based on random simplices, Ann Stat, 18, 405-414, (1990) · Zbl 0701.62063
[34] López-Pintado S, Romo J (2006) Depth-based classification for functional data. In Data depth: robust multivariate analysis, computational geometry and applications, vol 72 of DIMACS Ser. Discrete Math. Theoret. Comput. Sci., pp 103-119. Am Math Soc, Providence, RI
[35] Maronna R, Martin D, Yohai V (2006) Robust statistics: theory and methods. Wiley, New York · Zbl 1094.62040
[36] Martin-Barragan, B.; Lillo, R.; Romo, J., Interpretable support vector machines for functional data, Eur J Op Res, 232, 146-155, (2014)
[37] Massé, J-C; Theodorescu, R., Halfplane trimming for bivariate distributions, J Multivar Anal, 48, 188-202, (1994) · Zbl 0790.60024
[38] Mosler, K.; Becker, C. (ed.); Fried, R. (ed.); Kuhnt, S. (ed.), Depth statistics, 17-34, (2013), Berlin
[39] Mosler, K.; Mozharovskyi, P., Fast DD-classification of functional data, Statistical Papers, (2016) · Zbl 1416.62352
[40] Müller, DW; Sawitzki, G., Excess mass estimates and tests for multimodality, J Am Stat Assoc, 86, 738-746, (1991) · Zbl 0733.62040
[41] Nagy, S.; Gijbels, I.; Omelka, M.; Hlubinka, D., Integrated depth for functional data: statistical properties and consistency, ESAIM Probab Stat, (2016) · Zbl 1357.62201
[42] Paindaveine, D.; Šiman, M., Computing multiple-output regression quantile regions, Comput Stat Data Anal, 56, 840-853, (2012) · Zbl 1244.62060
[43] Pigoli, D.; Sangalli, L., Wavelets in functional data analysis: estimation of multidimensional curves and their derivatives, Comput Stat Data Anal, 56, 1482-1498, (2012) · Zbl 1243.62077
[44] Ramsay J, Silverman B (2005) Functional data analysis, 2nd edn. Springer, New York · Zbl 1079.62006
[45] Riani, M.; Zani, S.; Rizzi, A. (ed.); Vichi, M. (ed.); Bock, HH (ed.), Generalized distance measures for asymmetric multivariate distributions, 503-508, (2000), Berlin
[46] Rossi, F.; Villa, N., Support vector machine for functional data classification, Neurocomputing, 69, 730-742, (2006)
[47] Rousseeuw, PJ; Hubert, M., Regression depth, J Am Stat Assoc, 94, 388-402, (1999) · Zbl 1007.62060
[48] Rousseeuw PJ, Leroy A (1987) Robust regression and outlier detection. Wiley-Interscience, New York · Zbl 0711.62030
[49] Rousseeuw, PJ; Ruts, I., Bivariate location depth, Appl Stat, 45, 516-526, (1996) · Zbl 0905.62002
[50] Rousseeuw, PJ; Ruts, I., Constructing the bivariate Tukey median, Stat Sinica, 8, 827-839, (1998) · Zbl 0905.62029
[51] Rousseeuw, PJ; Ruts, I., The depth function of a population distribution, Metrika, 49, 213-244, (1999) · Zbl 1093.62540
[52] Rousseeuw, PJ; Ruts, I.; Tukey, J., The bagplot: a bivariate boxplot, Am Stat, 53, 382-387, (1999)
[53] Rousseeuw, PJ; Struyf, A., Computing location depth and regression depth in higher dimensions, Stat Comput, 8, 193-203, (1998)
[54] Ruts, I.; Rousseeuw, PJ, Computing depth contours of bivariate point clouds, Comput Stat Data Anal, 23, 153-168, (1996) · Zbl 0900.62337
[55] Stahel W (1981) Robuste Schätzungen: infinitesimale Optimalität und Schätzungen von Kovarianzmatrizen. PhD thesis, ETH Zürich
[56] Struyf, A.; Rousseeuw, PJ, High-dimensional computation of the deepest location, Comput Stat Data Anal, 34, 415-426, (2000) · Zbl 1046.62055
[57] Thakoor N, Gao J (2005) Shape classifier based on generalized probabilistic descent method with hidden Markov descriptor. Tenth IEEE International Conference on Computer Vision (ICCV 2005), vol 1, pp 495-502
[58] Tukey J (1975) Mathematics and the picturing of data. In: Proceedings of the International Congress of Mathematicians. Vol 2, Vancouver, pp 523-531 · Zbl 0347.62002
[59] Zuo, Y., Projection-based depth functions and associated medians, Ann Stat, 31, 1460-1490, (2003) · Zbl 1046.62056
[60] Zuo, Y.; Serfling, R., General notions of statistical depth function, Ann Stat, 28, 461-482, (2000) · Zbl 1106.62334
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.