Algorithms on strings, trees, and sequences. Computer science and computational biology. (English) Zbl 0934.68103

Cambridge: Cambridge University Press. xviii, 534 p. (1997).
Most traditional texts on algorithms in computer science contain one, perhaps two, principal chapters on techniques related to strings and sequences. This book is devoted entirely to these topics and for very good reason. The area of computational biology is hot, and the abstraction of DNA from a flexible three-dimensional molecule to a one-dimensional string opens the way for the use of powerful computing techniques to assist in the analysis of the structure and properties of these fundamental biological building blocks. So, unlike many computer science texts, the book is focused on applications to biology. Most algorithms presented are motivated by related biological problems. But at the same time the text is not just a “how to” manual on string and sequence analysis. A wide range of algorithms are rigorously presented along with correctness proofs and timing analyses. The author usually introduces a problem with a naive method for solving it, followed by repeatedly enhanced and sophisticated methods. Among the wide spectrum of techniques presented, most are illustrated with problems in current biological research. But many are chosen because of their potential application in future molecular biology. This stamps the book as ideal for computer scientists wishing to engage in the rapidly changing field of bioinformatics research. The author also points out that there will be many biologists and mathematicians with sufficient algorithmic background to read the book. The book includes all of the classic topics and most of the important advanced techniques in the field of string algorithms, with the exception of probabilistic methods (only lightly touched upon), parallel algorithms, theoretical results on algorithms for infinite alphabets and on algorithms that use only constant space, and machine learning stochastic-oriented methods.
The text begins in Part I with the fundamental problem of exact string matching, starting with the Boyer-Moore and Knuth-Morris-Pratt algorithms, and then proceeding to variations including the Karp-Rabin fingerprint methods. Part II is devoted to suffix trees, a data structure that allows efficient solutions to a wide range of complex string problems. The suffix tree construction algorithms of Ukkonen, Weiner and McCreight are presented, followed by more than twenty applications, including a major section on lowest common ancestor retrieval. Part III covers the topics of inexact matching, sequence alignment and dynamic programming. The section begins with a detailed examination of the classic edit distance problem and continues with discussions of string similarity and alignment. Part III concludes with a chapter on the Holy Grail of multiple string comparison, one of the most important and active research areas in current biological sequence analysis, and a chapter on the building and searching of databases holding molecular sequence data. In Part IV the author branches out from well established techniques and from problems strictly defined on strings. He first discusses “currents”, being techniques that may lead to more powerful and effective methods, or techniques that may become less important as technology changes. He then discusses “cousins”, being problems, such as physical mapping, fragment assembly, and building phylogenetic trees, that, although related to string problems, are not themselves string problems. Finally he introduces a few important “cameo” topics, as well as topics that cross all three categories. The book concludes with a comprehensive bibliography containing almost 500 entries and, of particular use to Computer Scientists, a glossary composed predominantly of terms in molecular biology.
This work is a timely contribution to the burgeoning research effort in computational biology. As the author himself points out,“Computer Science will make the most serious contribution to biology after the emergence of a large community of people who understand both fields and who know the strengths and needs of both”. By drawing together such a wide range of applicable topics and presenting them in such a clear, motivational, and informative manner the author has contributed significantly to this effort and has provided an invaluable resource for graduate students and professional academics working, or intending to work, in the area of computational biology. The book could well be used as the basis for a graduate-level course, particularly as it contains over 400 exercises to reinforce presented material and to develop further topics. It is recommended most highly.


68W05 Nonnumerical algorithms
68-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to computer science
68Q25 Analysis of algorithms and problem complexity
68R15 Combinatorics on words
68U15 Computing methodologies for text processing; mathematical typography
92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
Full Text: DOI