## Mutual information, metric entropy and cumulative relative entropy risk.(English)Zbl 0920.62007

Given a sample of size $$n$$ distributed according $$P_{\theta^*}$$ let $$\widehat P_t$$ be an estimate of $$P_{\theta^*}$$ based on the observations up to prime $$t\leq n$$. The authors measure the total risk up to time $$n$$ by the Kullback-Leibler divergence $$R_{n,\widehat P}(\theta^*)=D_{KL} (P^n_{\theta^*} \|\widehat P)$$ where $$\widehat P$$ is the product measure of the $$\widehat P_t$$. Using this risk the authors introduce the minimax as well as the Bayes strategy. It turns out that the Bayes risk is the mutual information $$I (\Theta^*;Y^n)$$ of the random parameter $$\Theta^*$$ and the sample $$Y^n$$.
Let $$M_{n,\mu}$$ be the marginal distribution of $$Y^n$$ if $$\Theta^*$$ has distribution $$\mu$$. The main results in the first part of the paper concern two-sided bounds of $$I(\Theta^*;Y^n)$$ and $$D_{KL}(P^n_{\theta^*}\| M_{n,\mu})$$ in terms of Renyi distances $$I_\alpha (P_{\theta^*}, P_{\tilde\theta})$$ of order $$\alpha$$ and $$I_1 (P_{\theta^*},Q^{\tilde\theta})$$ where $$Q_\theta$$ is the distribution of $$Y$$ given $$\theta$$. For finite parameter set $$I(\Theta^*;Y^n)$$ is shown to tend with exponential rate to the entropy $$H(\Theta^*)$$.
The asymptotics of the minimax risk based on the risk $$R_{n,\widehat P}(\theta^*)$$ is studied for any $$\Theta$$ in the second part of the paper. The authors use the concepts metric entropy and metric dimension to describe the topological structure of $$\Theta$$ and to relate these properties to the minimax risk for large $$n$$.
Reviewer: F.Liese (Rostock)

### MSC:

 62B10 Statistical aspects of information-theoretic topics 62C20 Minimax procedures in statistical decision theory 94A15 Information theory (general)
Full Text:

### References:

 [1] AMARI, S. 1982. Differential geometry of curved exponential families curvatures and information loss. Ann. Statist. 10 357 385. · Zbl 0507.62026 · doi:10.1214/aos/1176345779 [2] AMARI, S. and MURATA, N. 1993. Statistical theory of learning curves under entropic loss. Neural Comput. 5 140 153. [3] BARRON, A. 1985. The strong ergodic theorem for densities: generalized Shannon McMillan Breiman theorem. Ann. Probab. 13 1292 1303. · Zbl 0608.94001 · doi:10.1214/aop/1176992813 [4] BARRON, A. 1987. Are Bay es rules consistent in information? In Open Problems in CommuZ. nication and Computation T. M. Cover and B. Gopinath, eds. 85 91. Springer-Verlag, New York. [5] BARRON, A. 1987. The exponential convergence of posterior probabilities with implications for Bay es estimators of density functions. Technical Report 7, Dept. Statistics, Univ. Illinois Urbana-Champaign. [6] BARRON, A., CLARKE, B. and HAUSSLER, D. 1993. Information bounds for the risk of Bayesian predictions and the redundancy of universal codes. Proceedings of the International Sy mposium on Information Theory. IEEE Press, New York. [7] BARRON, A. and COVER, T. 1988. A bound on the financial value of information. IEEE Trans. Inform. Theory 34 1097 1100. · Zbl 0662.90023 · doi:10.1109/18.21241 [8] BARRON, A., Gy ORFI, L. and VAN DER MEULEN, E. 1992. Distribution estimation consistent ïn total variation and in two ty pes of information divergence. IEEE Trans. Inform. Theory 38 1437 1454. · Zbl 0765.62007 · doi:10.1109/18.149496 [9] BARRON, A. and YANG, Y. 1995. Information theoretic lower bounds on convergence rates of nonparametric estimators. Unpublished manuscript. [10] BIRGE, L. 1983. Approximation dans les espaces metriques et theorie de l’estimation. Z. \' \' Ẃahrsch. Verw. Gebiete 65 181 237. · Zbl 0506.62026 · doi:10.1007/BF00532480 [11] BIRGE, L. 1986. On estimating a density using Hellinger distance and some other strange \'facts. Probab. Theory Related Fields 71 271 291. · Zbl 0561.62029 · doi:10.1007/BF00332312 [12] BIRGE, L. and MASSART, P. 1993. Rates of convergence for minimum contrast estimators. Ṕrobab. Theory Related Fields 97 113 150. · Zbl 0805.62037 · doi:10.1007/BF01199316 [13] CAMERON, R. H. and MARTIN, W. T. 1944. Transformation of Wiener integrals under translations. Ann. Math. 45 386 396. JSTOR: · Zbl 0063.00696 · doi:10.2307/1969276 [14] CLARKE, B. 1989. Asy mptotic cumulative risk and Bay es risk under entropy loss with applications. Ph.D. thesis, Dept. Statistics, Univ. Illinois. [15] CLARKE, B. and BARRON, A. 1990. Information-theoretic asy mptotics of Bay es methods. IEEE Trans. Inform. Theory 36 453 471. · Zbl 0709.62008 · doi:10.1109/18.54897 [16] CLARKE, B. and BARRON, A. 1994. Jeffery s’ prior is asy mptotically least favorable under entropy risk. J. Statist. Plann. Inference 41 37 60. · Zbl 0820.62006 · doi:10.1016/0378-3758(94)90153-8 [17] CLEMENTS, G. F. 1963. Entropy of several sets of real-valued functions. Pacific J. Math. 13 1085 1095. · Zbl 0158.05002 · doi:10.2140/pjm.1963.13.1085 [18] COVER, T. and THOMAS, J. 1991. Elements of Information Theory. Wiley, New York. · Zbl 0762.94001 [19] DAVISSON, L. and LEON-GARCIA, A. 1980. A source matching approach to finding minimax codes. IEEE Trans. Inform. Theory 26 166 174. · Zbl 0431.94026 · doi:10.1109/TIT.1980.1056167 [20] DEVROy E, L. and Gy ORFI, L. 1986. Nonparametric Density Estimation, the L View. Wiley, \" 1 New York. [21] DIACONIS, P. and FREEDMAN, D. 1986. On the consistency of Bay es estimates. Ann. Statist. 14 1 26. · Zbl 0595.62022 · doi:10.1214/aos/1176349830 [22] DUDLEY, R. M. 1984. A course on empirical processes. Lecture Notes in Math. 1097 2 142. Springer, New York. · Zbl 0554.60029 [23] EFROIMOVICH, S. Y. 1980. Information contained in a sequence of observations. Problems Inform. Transmission 15 178 189. · Zbl 0435.94008 [24] FEDER, M., FREUND, Y. and MANSOUR, Y. 1995. Optimal universal learning and prediction of probabilistic concepts. In Proceedings of the IEEE Information Theory Conference 233. IEEE, New York. [25] GALLAGER, R. 1979. Source coding with side information and universal coding. Technical Report LIDS-P-937, Laboratory for Information and Decision Sy stems, MIT. [26] GHOSH, J., GHOSAL, S. and SAMANTA, T. 1994. Stability and convergence of the posterior in Z non-regular problems. In Statistical Decision Theory and Related Topics. V S. Gupta. and J. O. Berger, eds.. Springer, New York. · Zbl 0798.62037 [27] GINE, E. and ZINN, J. 1984. Some limit theorems for empirical processes. Ann. Probab. 12 \' 929 989. · Zbl 0553.60037 · doi:10.1214/aop/1176993138 [28] Gy ORFI, L., PALI, I. and VAN DER MEULEN, E. 1994. There is no universal source code for an \" ínfinite alphabet. IEEE Trans. Inform. Theory 40 267 271. · Zbl 0802.94006 · doi:10.1109/18.272495 [29] HASMINSKII, R. and IBRAGIMOV, I. 1990. On density estimation in the view of Kolmogorov’s ideas in approximation theory. Ann. Statist. 18 999 1010. · Zbl 0705.62039 · doi:10.1214/aos/1176347736 [30] HAUSSLER, D. 1997. A general minimax result for relative entropy. IEEE Trans. Inform. Theory 40 1276 1280. · Zbl 0878.94038 · doi:10.1109/18.605594 [31] HAUSSLER, D. and BARRON, A. 1992. How well do Bay es methods work for on-line predicÄ 4 tion of 1, 1 values? In Proceedings of the Third NEC Sy mposium on Computation and Cognition 74 100. SIAM, Philadelphia. [32] HAUSSLER, D., KEARNS, M. and SCHAPIRE, R. E. 1994. Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. Machine Learning 14 83 113. · Zbl 0798.68145 [33] HAUSSLER, D. and OPPER, M. 1995. General bounds on the mutual information between a parameter and n conditionally independent observations. In Proceedings of the Seventh Annual ACM Workshop on Computational Learning Theory 402 411. ACM Press, New York. [34] HAUSSLER, D. and OPPER, M. 1996. Mutual information, metric entropy, and risk in estimation of probability distributions. Technical Report UCSC-CRL-96-27, Comput. Res. Lab., Univ. California, Santa Cruz. [35] IBRAGIMOV, I. and HASMINSKII, R. 1972. On the information in a sample about a parameter. In Second International Sy mposium on Information Theory 295 309. IEEE, New York. [36] IZENMAN, A. J. 1991. Recent developments in nonparametric density estimation. J. Amer. Statist. Assoc. 86 205 224. JSTOR: · Zbl 0734.62040 · doi:10.2307/2289732 [37] KOLMOGOROV, A. N. and TIKHOMIROV, V. M. 1961. -entropy and -capacity of sets in functional spaces. Amer. Math. Soc. Trans. Ser. 2 17 277 364. [38] KOMAKI, F. 1994. On asy mptotic properties of predictive distributions. Technical Report METR 94-21, Dept. Math. Engrg. Phy s., Univ. Toky o. [39] LECAM, L. 1986. Asy mptotic Methods in Statistical Decision Theory. Springer, New York. · Zbl 0605.62002 [40] MEIR, R. and MERHAV, N. 1995. On the stochastic complexity of learning realizable and unrealizable rules. Machine Learning 19 241 261. · Zbl 0830.68109 · doi:10.1109/18.391272 [41] MERHAV, N. and FEDER, M. 1995. A strong version of the redundancy-capacity theorem of universal coding. IEEE Trans. Inform. Theory 41 714 722. · Zbl 0821.94020 · doi:10.1109/18.382017 [42] OPPER, M. and HAUSSLER, D. 1991. Calculation of the learning curve of Bay es optimal classification algorithm for learning a perceptron with noise. In Proceedings of the Fourth Annual Workshop on Computational Learning Theory 75 87. Morgan Kaufmann, San Mateo, CA. [43] OPPER, M. and HAUSSLER, D. 1995. Bounds for predictive errors in the statistical mechanics of in supervised learning. Phy s. Rev. Lett. 75 3772 3775. [44] PINSKER, M. S. 1964. Information and Information Stability of Random Variables and Processes. Holden-Day, Oakland, CA. · Zbl 0125.09202 [45] POLLARD, D. 1990. Empirical Processes: Theory and Applications. IMS, Hay ward, CA. · Zbl 0741.60001 [46] RENy I, A. 1960. On measures of entropy and information. Proc. Fourth Berkeley Sy mp. Math. Statist. Probab. 1 547 561. Univ. California Press, Berkeley. [47] RENy I, A. 1964. On the amount of information concerning an unknown parameter in a sequence of observations. Publ. Math. Inst. Hungar. Acad. Sci. 9 617 625. · Zbl 0138.14803 [48] RISSANEN, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14 1080 1100. · Zbl 0602.62008 · doi:10.1214/aos/1176350051 [49] RISSANEN, J., SPEED, T. and YU, B. 1992. Density estimation by stochastic complexity. IEEE Trans. Inform. Theory 38 315 323. · Zbl 0743.62004 · doi:10.1109/18.119689 [50] Sy MANZIK, K. 1965. Proof and refinements of an inequality of Fey nman. J. Math. Phy s. 6 1155 1165. [51] WONG, W. and SHEN, X. 1995. Probability inequalities for likelihood ratios and convergence rates for sieve MLE’s. Ann. Statist. 23 339 362. · Zbl 0829.62002 · doi:10.1214/aos/1176324524 [52] YAMANISHI, K. 1995. A loss bound model for on-line stochastic prediction algorithms. Inform. Comput. 119 39 54. · Zbl 0832.68053 · doi:10.1006/inco.1995.1076 [53] YU, B. 1996. Lower bounds on expected redundancy for nonparametric classes. IEEE Trans. Inform. Theory 42 272 275. · Zbl 0843.62006 · doi:10.1109/18.481802 [54] ZHU, H. and ROHWER, R. 1995. Information geometric measurements of generalization. Technical Report NCRG 4350, Neural Computing Research Group, Aston Univ., England. [55] HAUSSLER, D. and OPPER, M. 1997. Metric entropy and minimax risk in classification. In Z Lecture Notes in Comp. Sci.: Studies in Logic and Comp. Sci. J. My cielski, G.. Rozenberg and A. Salomaa, eds. 1261 212 235. Springer-Verlag, New York. [56] SANTA CRUZ, CALIFORNIA 95064 GERMANY E-MAIL: haussler@cse.ucsc.edu E-MAIL: opper@physik.uni-wuerzburg.de \"
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.