Comparative gene finding. Models, algorithms and implementation. 2nd ed. (English) Zbl 1350.92001

Computational Biology 20. London: Springer (ISBN 978-1-4471-6692-4/hbk; 978-1-4471-6693-1/ebook). xx, 382 p. (2015).
The book under review is an updated and revised version of the first edition and contains in particular a new chapter describing tools and approaches based on next generation sequencing (NGS) used for gene prediction. The book is structured into eight chapters and commences with an introduction of the required notions such as the central dogma, the structure of a gene, and common challenges in defining the location of a gene. The author also includes a historical overview of algorithms developed for gene finding. The chapter concludes with a brief description of the four steps towards building a gene finder. The second chapter discusses single species gene finding using hidden Markov models. First, Markov chains and elements of dynamic programming are introduced. Next, the forward, backward and Viterbi algorithms are discussed. As an example, the author presents EasyGene, a gene finder for prokaryotes. In the next sections, generalized hidden Markov models (GHMMs) and interpolated Markov models (IMMs) are presented. As examples, Genscan, based on GHMMs, and GLIMMER, based on IMMs, are discussed. The chapter also describes approaches based on neural networks, with GRAIL as example, decision trees, with MORGAN as example, and on conditional random fields, with Conrad as example. In the third chapter, the author reviews pairwise and multiple sequence alignments. For the former, the nucleotide and amino-acids substitution models are discussed, followed by a presentation of the Needleman-Wunsch and the Smith-Waterman algorithms. To ensure a thorough description of the multiple sequence alignments, the author commences with an overview of scoring schemes and methods for progressive alignments and other iterative approaches. Next, various methods, accompanied by examples, are discussed. As application to HMMs, the author presents SAM – sequence alignment and modelling; for genetic algorithms, the example is SAGA – sequence alignments by genetic algorithms, and for simulated annealing, the example is MSASA, which produces multiple alignments by iterative rearrangements. In the fourth chapter, the author describes comparative gene finding based either on
similarity searches (GenomeScan, a GHMM-based gene finding approach using homology),
heuristic cross species method (ROSETTA),
pair hidden Markov models (DoubleScan),
generalized pair hidden Markov models (SLAM),
gene mapping (GeneMapper) and
multiple sequence gene finding (N-scan).
The fifth chapter is focused on gene structure submodels. First, the state space is described, comprising of exons, splice sites, introns and intergenic regions, untranslated regions, and promoters. Next, state length distributions are discussed, followed by content sensors such as GC content, codon and amino-acid usage and \(k\)-tuple frequency analysis. The chapter concludes with approaches for the detection of splice sites, including weight matrices and weight array models, variable length Markov models, maximal dependence decomposition, and others. In the sixth chapter, the author describes the parameter search for gene finding algorithms; MLE, EM and Baum-Welch are discussed in detail. The back-propagation algorithm, discriminative training, Gibbs sampling and simulated annealing are also presented. The seventh chapter focuses on the implementation of a comparative gene finder. First, the program structure is described, followed by an example on the GPHMM model for modelling intron and intergenic pairs and exon pairs. In the eight chapter, the author presents annotation pipelines for next generation sequencing data. The chapter commences with a brief overview of DNA sequencing and current NGS technologies. Next, the author presents NGS annotation pipelines which investigate the assembly quality and can be used for repeat masking and gene annotation. As example, MAKER is presented, which combines various sources of evidence for genome annotation and whose output is compatible with commonly used GMOD programs such as GBROWSE and JBROWSE.
The structure of the book mirrors the learning steps for understanding how to perform gene finding. Although written in an accessible language, it requires some prior knowledge of the various fields it covers, such as probability theory, statistics, information theory, optimization theory and numerical analysis. Its target audience is mainly post-graduate researchers or established researchers with a background in mathematics or statistics applied in bioinformatics who need a thorough yet concise overview of this field.


92-02 Research exposition (monographs, survey articles) pertaining to biology
92-08 Computational methods for problems pertaining to biology
92C40 Biochemistry, molecular biology
92D10 Genetics and epigenetics
60J22 Computational methods in Markov chains
92B20 Neural networks for/in biological studies, artificial life and related topics
90C39 Dynamic programming


Zbl 1296.92004
Full Text: DOI