Biological sequence analysis. Probabilistic models of proteins and nucleic acids.

*(English)*Zbl 0929.92010
Cambridge: Cambridge University Press. 360 p. (1998).

Computational Biology, also called Bioinformatics, is a rapidly growing, interdisciplinary field posed to be of immense importance in the first decades of the 21-st century where megabytes of raw biological data (e.g., of genome sequences, protein conformations, etc.) can only be analyzed and interpreted using computer programs. Though bioinformatics applications concern molecular biology (and in the future certainly other fields, such as neurobiology), the methods used issue from statistics, computer science, and mathematics. For this reason, different texts treat different aspects of computational biology. For instance, the books of M. Waterman [Introduction to computational biology: Maps, sequences and genomes. (1995; Zbl 0831.92011)], P. Baldi and S. Brunak [Bioinformatics - The machine learning approach. (1998)], J. Setubal and J. Meidanis [Introduction to computational molecular biology. (1997)] all have a different area of focus, where for instance Baldi and Brunak emphasize machine learning algorithms such as hidden Markov models, neural nets, and hybrid models.

The current text under review is a welcome addition to existing books, and, as its title indicates, focuses mainly on probabilistic methods including Markov models, hidden Markov models, parsiminony and maximum likelihood phylogeny construction algorithms, stochastic context-free grammars, profiles, etc. The main application areas concern sequence alignment (pairwise and multiple), genomic motif recognition, phylogeny, and RNA secondary structure. The text is clearly aimed not only at an audience of statisticians, computer scientists and mathematicians, but also at practicing biologists interested in computational biology. To that end, though entirely rigorous, the book does an admirable job of suppressing certain mathematical details, which might otherwise be distracting to biologists; for instance, details about probability distributions (such as Dirichlet distributions), entropy, and the EM algorithm (Expectation-Maximization) are relegated to the last chapter. By these means, the book gains in general readability. Broadly speaking, the book is Bayesian in outlook – given a certain prior probability distribution (such as compositional frequencies of amino acids), one computes a new posterior distribution using Bayesian learning (e.g. a stochastic model for a class of proteins, when given a training set of members of the class to be learned).

There are 11 chapters in the book. Chapter 1 is an introduction to probability models, maximum likelihood and maximum a posteriori estimation. Simple examples concerning dice illustrate concepts such as the multinomial distribution for a wide audience. Chapter 2 concerns pairwise sequence alignment using dynamic programming methods (Needleman - Wunsch, Smith - Waterman, etc.). Chapter 3 concerns Markov chains and hidden Markov models (HMMs). In a hidden Markov model, one cannot observe the state of the model, but only an output symbol, where the output distribution depends on the state. Thus a stochastic model can be constructed for a protein family (say G-coupled receptor proteins), where with a certain probability, a particular amino acid symbol will be output at the \(i\)-th position of the chain. Chapter 4 explains how to perform pairwise sequence alignments using HMM’s. Since pairwise sequence alignment is not typically computed using HMM’s, this chapter is largely pedagogic and provides the reader with better intuition for later applications of HMM’s in multiple sequence alignments.

Chapters 5 and 6 concern the construction of profiles for database searches and survey some methods for multiple sequence alignments, known to be an \(NP\)-complete problem. For instance, given a training set of globin proteins, using HMMs one can construct a stochastic model of a globin, yielding a profile, used then to search for hits in a database. Chapters 7 and 8 present algorithms for phylogeny construction (clustering methods, maximum likelihood, parsimony, etc.). Chapter 9 concerns formal grammars – transformational, regular, context-free, stochastic context-free, as these can be viewed as generalizations of HMMs. Chapter 10 covers RNA secondary structure prediction using dynamic programming and stochastic context-free grammars. The idea of applying grammars to describe RNA secondary structure goes back to Searls. The gist of the idea can be seen, in considering a hairpin loop ACGUXXXXXACGU, where base pairing is of the form ((((.....)))). In this fashion, secondary structure is essentially a balanced parenthesis expression, hence generated by a context-free grammar of the form \(S \rightarrow (S)S | \mathtt{.} | \epsilon\). Placing probabilities on the grammar’s production rules yields a stochastic context-free grammar, and the CYK or Early parser, along with the forward-backward method of HMMs, yields an efficient algorithm to learn the probabilities of rule applications given a training set. Finally, Chapter 11, as earlier mentioned, is essentially an appendix for the most important concepts from probability theory and statistics used for the algorithms of this text.

Molecular biology, like quantum physics in the early part of the 20-th century, is now at the exciting stage where important contributions are comming from mathematics and the related fields of statistics and computer science. Any scientifically educated reader with an interest in understanding the algorithmic techniques of computational molecular biology – used, for instance, after DNA amplification, to determine that Neanderthal was not a direct ancestor of H. sapiens in the work of Svante Pääbo – should read this well-written and enjoyable classic.

The current text under review is a welcome addition to existing books, and, as its title indicates, focuses mainly on probabilistic methods including Markov models, hidden Markov models, parsiminony and maximum likelihood phylogeny construction algorithms, stochastic context-free grammars, profiles, etc. The main application areas concern sequence alignment (pairwise and multiple), genomic motif recognition, phylogeny, and RNA secondary structure. The text is clearly aimed not only at an audience of statisticians, computer scientists and mathematicians, but also at practicing biologists interested in computational biology. To that end, though entirely rigorous, the book does an admirable job of suppressing certain mathematical details, which might otherwise be distracting to biologists; for instance, details about probability distributions (such as Dirichlet distributions), entropy, and the EM algorithm (Expectation-Maximization) are relegated to the last chapter. By these means, the book gains in general readability. Broadly speaking, the book is Bayesian in outlook – given a certain prior probability distribution (such as compositional frequencies of amino acids), one computes a new posterior distribution using Bayesian learning (e.g. a stochastic model for a class of proteins, when given a training set of members of the class to be learned).

There are 11 chapters in the book. Chapter 1 is an introduction to probability models, maximum likelihood and maximum a posteriori estimation. Simple examples concerning dice illustrate concepts such as the multinomial distribution for a wide audience. Chapter 2 concerns pairwise sequence alignment using dynamic programming methods (Needleman - Wunsch, Smith - Waterman, etc.). Chapter 3 concerns Markov chains and hidden Markov models (HMMs). In a hidden Markov model, one cannot observe the state of the model, but only an output symbol, where the output distribution depends on the state. Thus a stochastic model can be constructed for a protein family (say G-coupled receptor proteins), where with a certain probability, a particular amino acid symbol will be output at the \(i\)-th position of the chain. Chapter 4 explains how to perform pairwise sequence alignments using HMM’s. Since pairwise sequence alignment is not typically computed using HMM’s, this chapter is largely pedagogic and provides the reader with better intuition for later applications of HMM’s in multiple sequence alignments.

Chapters 5 and 6 concern the construction of profiles for database searches and survey some methods for multiple sequence alignments, known to be an \(NP\)-complete problem. For instance, given a training set of globin proteins, using HMMs one can construct a stochastic model of a globin, yielding a profile, used then to search for hits in a database. Chapters 7 and 8 present algorithms for phylogeny construction (clustering methods, maximum likelihood, parsimony, etc.). Chapter 9 concerns formal grammars – transformational, regular, context-free, stochastic context-free, as these can be viewed as generalizations of HMMs. Chapter 10 covers RNA secondary structure prediction using dynamic programming and stochastic context-free grammars. The idea of applying grammars to describe RNA secondary structure goes back to Searls. The gist of the idea can be seen, in considering a hairpin loop ACGUXXXXXACGU, where base pairing is of the form ((((.....)))). In this fashion, secondary structure is essentially a balanced parenthesis expression, hence generated by a context-free grammar of the form \(S \rightarrow (S)S | \mathtt{.} | \epsilon\). Placing probabilities on the grammar’s production rules yields a stochastic context-free grammar, and the CYK or Early parser, along with the forward-backward method of HMMs, yields an efficient algorithm to learn the probabilities of rule applications given a training set. Finally, Chapter 11, as earlier mentioned, is essentially an appendix for the most important concepts from probability theory and statistics used for the algorithms of this text.

Molecular biology, like quantum physics in the early part of the 20-th century, is now at the exciting stage where important contributions are comming from mathematics and the related fields of statistics and computer science. Any scientifically educated reader with an interest in understanding the algorithmic techniques of computational molecular biology – used, for instance, after DNA amplification, to determine that Neanderthal was not a direct ancestor of H. sapiens in the work of Svante Pääbo – should read this well-written and enjoyable classic.

Reviewer: Peter Clote (München)

##### MSC:

92C40 | Biochemistry, molecular biology |

92-01 | Introductory exposition (textbooks, tutorial papers, etc.) pertaining to biology |

92-02 | Research exposition (monographs, survey articles) pertaining to biology |

92-08 | Computational methods for problems pertaining to biology |