Prediction and entropy of printed English.

*(English)*Zbl 1165.94313Introduction: In a previous paper [C. E. Shannon, Bell Syst. Tech. J. 27, 379–423, 623–656 (1948; Zbl 1154.94303)] the entropy and redundancy of a language have been defined. The entropy is a statistical parameter which measures,
in a certain sense, how much information is produced on the average for
each letter of a text in the language. If the language is translated into binary
digits (0 or 1) in the most efficient way, the entropy is the average number
of binary digits required per letter of the original language. The redundancy,
on the other hand, measures the amount of constraint imposed on a text in
the language due to its statistical structure, e.g., in English the high frequency of the letter E, the strong tendency of H to follow T or of V to follow
Q. It was estimated that when statistical effects extending over not more
than eight letters are considered the entropy is roughly 2.3 bits per letter,
the redundancy about 50 per cent.
Since then a new method has been found for estimating these quantities,
which is more sensitive and takes account of long range statistics, influences
extending over phrases, sentences, etc. This method is based on a study of
the predictability of English; how well can the next letter of a text be predicted when the preceding \(N\) letters are known. The results of some experiments in prediction will be given, and a theoretical analysis of some of the
properties of ideal prediction. By combining the experimental and theoretical results it is possible to estimate upper and lower bounds for the entropy
and redundancy. From this analysis it appears that, in ordinary literary
English, the long range statistical effects (up to 100 letters) reduce the
entropy to something of the order of one bit per letter, with a corresponding
redundancy of roughly 75%. The redundancy may be still higher when
structure extending over paragraphs, chapters, etc. is included. However, as
the lengths involved are increased, the parameters in question become more
erratic and uncertain, and they depend more critically on the type of text
involved.