Multiclass classification, information, divergence and surrogate risk. (English) Zbl 1408.62115

Summary: We provide a unifying view of statistical information measures, multiway Bayesian hypothesis testing, loss functions for multiclass classification problems and multidistribution \(f\)-divergences, elaborating equivalence results between all of these objects, and extending existing results for binary outcome spaces to more general ones. We consider a generalization of \(f\)-divergences to multiple distributions, and we provide a constructive equivalence between divergences, statistical information (in the sense of DeGroot) and losses for multiclass classification. A major application of our results is in multiclass classification problems in which we must both infer a discriminant function \(\gamma\) – for making predictions on a label \(Y\) from datum \(X\) – and a data representation (or, in the setting of a hypothesis testing problem, an experimental design), represented as a quantizer \(\mathsf{q}\) from a family of possible quantizers \(\mathsf{Q}\). In this setting, we characterize the equivalence between loss functions, meaning that optimizing either of two losses yields an optimal discriminant and quantizer \(\mathsf{q}\), complementing and extending earlier results of X. Nguyen et al. [ibid. 37, No. 2, 876–904 (2009; Zbl 1162.62060)] to the multiclass case. Our results provide a more substantial basis than standard classification calibration results for comparing different losses: we describe the convex losses that are consistent for jointly choosing a data representation and minimizing the (weighted) probability of error in multiclass classification problems.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62B10 Statistical aspects of information-theoretic topics
62G10 Nonparametric hypothesis testing
62K05 Optimal statistical designs
94A17 Measures of information, entropy


Zbl 1162.62060


Full Text: DOI arXiv Euclid


[1] Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B28 131–142. · Zbl 0203.19902
[2] Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. J. Amer. Statist. Assoc.101 138–156. · Zbl 1118.62330 · doi:10.1198/016214505000000907
[3] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B57 289–300. · Zbl 0809.62014
[4] Berger, J. O. (2006). The case for objective Bayesian analysis. Bayesian Anal.1 385–402. · Zbl 1331.62042 · doi:10.1214/06-BA115
[5] Bernardo, J. M. (2005). Reference analysis. In Bayesian Thinking, Modeling and Computation (D. Day and C. R. Rao, eds.). Handbook of Statistics25 17–90. Elsevier, Amsterdam.
[6] Blackwell, D. (1951). Comparison of experiments. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability 93–102. Univ. California Press, Berkeley, CA. · Zbl 0044.14203
[7] Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley, Hoboken, NJ. · Zbl 1140.94001
[8] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung.2 299–318. · Zbl 0157.25802
[9] DeGroot, M. H. (1962). Uncertainty, information, and sequential experiments. Ann. Math. Stat.33 404–419. · Zbl 0151.22803 · doi:10.1214/aoms/1177704567
[10] Duchi, J. C., Khosravi, K. and Ruan, F. (201). Supplement to “Multiclass classification, information, divergence and surrogate risk.” DOI:10.1214/17-AOS1657SUPP. · Zbl 1408.62115
[11] García-García, D. and Williamson, R. C. (2012). Divergences and risks for multiclass experiments. In Proceedings of the Twenty Fifth Annual Conference on Computational Learning Theory.
[12] Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J. Amer. Statist. Assoc.102 359–378. · Zbl 1284.62093 · doi:10.1198/016214506000001437
[13] Grünwald, P. D. and Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. Ann. Statist.32 1367–1433. · Zbl 1048.62008 · doi:10.1214/009053604000000553
[14] Györfi, L. and Nemetz, T. (1978). \(f\)-dissimilarity: A generalization of the affinity of several distributions. Ann. Inst. Statist. Math.30 105–113. · Zbl 0453.62014 · doi:10.1007/BF02480206
[15] Hiriart-Urruty, J.-B. and Lemaréchal, C. (1993). Convex Analysis and Minimization Algorithms. II. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] 306. Springer, Berlin. · Zbl 0795.49001
[16] Kailath, T. (1967). The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol.15 52–60.
[17] Liese, F. and Vajda, I. (2006). On divergences and informations in statistics and information theory. IEEE Trans. Inform. Theory52 4394–4412. · Zbl 1287.94025 · doi:10.1109/TIT.2006.881731
[18] Longo, M., Lookabaugh, T. D. and Gray, R. M. (1990). Quantization for decentralized hypothesis testing under communication constraints. IEEE Trans. Inform. Theory36 241–255.
[19] Lugosi, G. and Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Ann. Statist.32 30–55. · Zbl 1105.62319
[20] Nguyen, X., Wainwright, M. J. and Jordan, M. I. (2009). On surrogate loss functions and \(f\)-divergences. Ann. Statist.37 876–904. · Zbl 1162.62060 · doi:10.1214/08-AOS595
[21] Österreicher, F. and Vajda, I. (1993). Statistical information and discrimination. IEEE Trans. Inform. Theory39 1036–1039. · Zbl 0792.62005 · doi:10.1109/18.256536
[22] Poor, H. V. and Thomas, J. B. (1977). Applications of Ali–Silvey distance measures in the design of generalized quantizers for binary decision systems. IEEE Trans. Commun.25 893–900. · Zbl 0373.94002 · doi:10.1109/TCOM.1977.1093935
[23] Pukelsheim, F. (2006). Optimal Design of Experiments. Classics in Applied Mathematics50. SIAM, Philadelphia, PA. · Zbl 1101.62063
[24] Reid, M. and Williamson, R. (2011). Information, divergence, and risk for binary experiments. J. Mach. Learn. Res.12 731–817. · Zbl 1280.68192
[25] Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull. Amer. Math. Soc.55 527–535. · Zbl 0049.37009 · doi:10.1090/S0002-9904-1952-09620-8
[26] Schapire, R. E. and Freund, Y. (2012). Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA. · Zbl 1278.68021
[27] Steinwart, I. (2007). How to compare different loss functions and their risks. Constr. Approx.26 225–287. · Zbl 1127.68089 · doi:10.1007/s00365-006-0662-3
[28] Tewari, A. and Bartlett, P. L. (2007). On the consistency of multiclass classification methods. J. Mach. Learn. Res.8 1007–1025. · Zbl 1222.62079
[29] Tishby, N., Pereira, F. and Bialek, W. (1999). The information bottleneck method. In The 37’th Allerton Conference on Communication, Control, and Computing.
[30] Tsitsiklis, J. N. (1993). Decentralized detection. In Advances in Signal Processing2 297–344. JAI Press, London.
[31] Vajda, I. (1972). On the \(f\)-divergence and singularity of probability measures. Period. Math. Hungar.2 223–234. · Zbl 0248.62001 · doi:10.1007/BF02018663
[32] Williamson, R. C., Vernet, E. and Reid, M. D. (2016). Composite multiclass losses. J. Mach. Learn. Res.17. · Zbl 1437.62241
[33] Zhang, T. (2003/04). Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res.5 1225–1251. · Zbl 1222.68344
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.