Information theory and statistics. (English) Zbl 0088.10406

Wiley Publication in Mathematical Statistics. New York: John Wiley & Sons, Inc.; London: Chapman & Sons, Inc. xvii, 395 p. (1959).
The monograph compiles results obtained by the author himself and by other authors; and it contains some new results.
The chapters 1–5 give the general theory. Let \(\mu_1\) and \(\mu_2\) be two probability distributions which are absolutely continuous with regard to each other and set \(d\mu_1 = g(x)d\mu_2\). The quantity \(I_{12} = \int \log g(x) \,d\mu_1\) and some other related quantities are introduced as measures of information. \(I_{12}\) measures the ability of the statistical experiment to distinguish \(\mu_1\) from \(\mu_2\) when \(\mu_1\) is the distribution underlying the experiment. The book is mainly concerned with the relevance of this information concept to statistical inference and touches only occasionally the concept of information in relation to communication theory. However the close connection between statistical inference and communication theory is described, “nature” being the channel source, the statistical experiment being the channel and the observation being the partly distorted signal received. An interesting inequality, due to Sakaguchi, connecting the \(I_{12}\) with the Wiener-Shannon concept is derived. On the other hand the attempt on page 7-8, purporting to connect the two concepts, is difficult to follow. It seems to be based on an erroneous use of the Bayes theorem. Equation (2,2) assumes \(H_1\) and \(H_2\) to be exclusive, which is not true in example 4.1.
In chapter 2 the fundamental properties of \(I_{12}\) are developed. It is nonnegative, additive and the grouping of observations will lead to not greater information. In general, the introduction of a statistic \(T(x)\) instead of \(x\) will not increase the amount of information, but if \(T(x)\) is sufficient for a family of distributions, neither will it decrease the amount of information connected with any two distributions of the family. This last property nicely ties the information measure to the intuitive notion of information which is used in justifying the principle of sufficiency. (The use of “conditional density” on page 20 seems not clear without further explanation, but is without consequences).
A very interesting mathematical property of the information concept, which is due to the author, is proved in chapter 3. Let \(T(x)\) be any statistic and \(\Theta\) a constant (possibly multidimensional). Among all \(\mu_1\) satisfying \(\int T(x) \,d\mu_1 = \Theta\), the one which for given \(\mu_2\) minimizes \(I_{12}\) is of the form \(d\mu_1 = \text{const. }e^{\tau T(x)}\) where \(\tau\) is a constant determined by \(\Theta\). Thus the Koopman’s exponential class of distributions is connected with minimum discrimination information. This class is studied in mathematical details. A relation between \(I_{12}\) and the Fisher information matrix has been derived in chapter 2 and further light is shed on this relation in chapter 3 by means of sufficiency and the exponential class. The author is lead to the Fréchet-Cramér-Rao inequality.
Let \(f(x, \Theta)\) be a probability density with regard to a known measure and \(\Theta\) an unknown parameter. Let \(I(\Theta_1, \Theta_2; y)\) be the information contained in the statistic \(y = T(x)\) for distinguishing \(f(x;\Theta_1)\) from \(f(x;\Theta_2)\). Discrimination efficiency of \(T(x)\) in \(\Theta\) is defined as \[ \lim_{\Delta\Theta\to 0}I(\Theta + \Delta\Theta, \Theta; y)/ I(\Theta + \Delta\Theta, \Theta; x). \]
Chapter 4 and 5 treat statistical decision problems in general by means of the information concept. The case of testing one hypothesis against another is studied. For given chance of first kind of error, lower bounds to the chance of second kind of error is expressible by means of the information concept. This is also true about the asymptotic properties of the chance of second kind of error, as the number of independent observations approaches infinity. The classification problem in the case of the exponential family is studied and it is shown that the familiar procedure accords with results obtained by using minimum discrimination information.
Estimating the minimum discrimination information in the case when \(\mu_2\) represents the probability distribution for independent observations, leads to a useful statistic. The statistic measures the resemblance between sample and population, when \(\mu_2\) is the true distribution, and is used to construct statistical procedures, in particular about testing hypothesis.
Properties of such procedures are studied, a new version of Wilk’s theorem of the likelihood ratio test due to Kupperman is developed.
The last eight chapters, chapter 6–13, are devoted to applications of the general theory to multinomial, Poisson and multivariate normal populations. In fact, the standard subjects in statistical theory are treated with the information concept as the unifying idea. Several subjects are gone deeper into than in ordinary treatments, thus for example homogeneity tests and interaction in multinomial analysis.
One might be in doubt as to the fundamental importance of the information concept in statistical decision theory. However, there is no doubt that the concept has given the author occasion to develop a great variety of very useful results.
The list of references is impressive and is frequently quoted in the text. The monograph will undoubtedly be a very valuable reference book.
Reviewer: E. Sverdrup


62-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to statistics
62B10 Statistical aspects of information-theoretic topics
94A15 Information theory (general)