zbMATH — the first resource for mathematics

Bioinformatics. Volume I. Data, sequence analysis, and evolution. 2nd edition. (English) Zbl 1378.92002
Methods in Molecular Biology 1525. New York, NY: Humana Press/Springer (ISBN 978-1-4939-6620-2/hbk; 978-1-4939-6622-6/ebook). x, 491 p. (2017).
The first volume of Bioinformatics, describing data, sequence analysis and evolution, is structured in three parts, namely, data and databases (Part 1), sequence analysis (Part 2) and phylogenetics and evolution (Part 3).
Part 1 consists of seven chapters focused on the description and information retrieval from several publicly available databases containing curated information; it also contains chapters guiding the reader on creating or curating annotations from expression data. In the first chapter, the authors review the three generations of sequencing: sequencing by synthesis (Sanger sequencing, first generation) and by cleavage (Maxam-Gilbert sequencing), the second generation (including 454 pyro-sequencing, Illumina sequencing and SOLiD) and third generation (including ion torrent, pacific bio and nanopore). The authors discuss the general steps underlying a whole genome sequencing task as well as choices and options for selecting a sequencing strategy. The chapter concludes with an overview of applications of sequencing, spanning from comparative genomics to drug development and the analysis of microbial populations. In the second chapter, a particular method PCAPSolexa, used for the assembly of short reads into longer contigs, is presented in detail. It is based on a hashing approach for determining overlaps between reads, with mismatches allowed and a graph-based approach to determine unique paths which form the contigs. In the third chapter, the authors review the steps required for determining the structure of a protein using crystallography and provide essential details on X-ray diffraction, on the use of the X-ray detector, and data measurement and processing; a hands-on example is also included. In the fourth chapter, the authors present the structure of the INSDC (International Nucleotide Sequence database, comprising of DDJB, ENA and GenBank) and, through examples, show the different angles of usage. The retrieval of genomic, transcriptomic or expression data as well as the submission and the maintenance of consistent entries are presented using examples. In the fifth chapter, the authors present approaches to retrieve and annotate a genome using publicly available information and complement the scarcity of public resources for some less frequently used organisms by showing how to use evolutionary conservation and variation or expression data to annotate a genome. In the sixth chapter, the authors describe the theoretical background of ontologies and demonstrate their use with examples on the Gene Ontology and the ontology for Biological Pathways Exchange. Approaches for calculating enrichments and for the visualization of resulting pathways are also provided. In the seventh chapter, the authors review methods frequently used for the classification of proteins (using the domains as discriminative features, given the underlying evolutionary conservation and divergence) such as automated domain sequence clustering, whole chain clustering and multiple sequence alignments based on patterns and profiles. Domain databases such as PROSITE, Pfam and SMART are also described.
The second part of the book, on sequence analysis, commences with the eighth chapter where the authors present the theoretical aspects of multiple sequence alignments (including methods based on dynamic programming and the progressive alignment protocol) and their role in inferring evolutionary links between organisms. The side effects of the input data (or of the pre-selection of input sequences, of unequal transcript lengths or of the existence of subfamilies) are also examined. Examples on common tools like PRALINE, MUSCLE, T-coffee, MAFFT, ProbCons, Kalign, MSAPprobs and Clustal Omega are also discussed. The ninth chapter is built as a continuation of the discussion of the previous chapter; the authors extend the description of the characteristics of similarity matrices such as PAM and BLOSSUM and present further examples on tools including LAGAN, MumMER, BLASTZ and AVID. The authors of the tenth chapter describe yet more sources of genomic information, including the Ensembl database, Vega, NCBI’s map viewer and the UCSC genome browser and review sequence based searches, motif-based and matrix-based ones. Databases indexed on RNA identifiers such as microRNA and piwiRNA datasets are also presented. In the eleventh chapter, the authors review frequently used computational tools for annotating genes including probabilistic ones (e.g., based on hidden Markov models (HMMs)) or on supervised or unsupervised approaches, including support vector machines or self organising maps. The post-processing component for the gene prediction annotation pipelines is also discussed. The last chapter of this part focuses on the identification and characterization of the segmentation structure of transcripts using changeptGUI, a tool based on a Bayesian segmentation and classification model followed by a Monte Carlo Markov chain (MCMC) simulation. To illustrate the steps and interpretation of the results, the author presents a worked out example with full details. The third part, on phylogenetics, commences with the thirteenth chapter, on natural selection. In this chapter, the author describes typical signatures of natural selection as they are observed in coding sequences and their inclusion into the CODEML model (part of the PAML package) based on examples (in R). In the fourteenth chapter, the authors provide a gentle introduction to inferring phylogenetic trees, starting with assumptions on the sequence data, adjusting the scoring of the trees and inferring time-dependent relations using maximum likelihood methods; approaches for refining the resulting trees are also described. In the fifteenth chapter, building on the previous chapter, the authors discuss the importance of understanding the inherent shortcomings and assumptions of existing algorithms in an attempt to identify an optimal model for evolution. Among the latter, stationarity, reversibility and homogeneity of the trees are discussed in detail. In the sixteenth chapter, the authors discuss methods for the detection of lateral gene transfer (LGT) events based on the comparison of phylogenetic trees. The main steps for this task and frequently used tools such as BLAST, MUSCLE, MAFFT, CLANN and others are compared side-by-side. In the seventeenth chapter, the authors discuss yet another source of evolution, genetic recombination, and its signature on phylogenetic trees using the RDP4 tool. The theoretical aspects, the input and the interpretation of the results are reviewed using examples. This part concludes with approaches for tree reconciliation using Bayesian approaches using the guenomu tool. The necessity of this approach, coupled with the effect of the various parameters, is assessed on examples.
The book is organised in a methodical manner, making it a good and reliable starting point for the study of bioinformatics (sequence analysis, databases and inference of phylogenetic trees). The numerous examples and the detailed explanation of frequently used tools provide the basis for subsequent, more in-depth studies. This collection of chapters is suitable for a wide range of audiences, from undergraduates to established researches, and various backgrounds, from mathematics and computer science to biology and medicine.

92-02 Research exposition (monographs, survey articles) pertaining to biology
92-08 Computational methods for problems pertaining to biology
92C40 Biochemistry, molecular biology
92D15 Problems related to evolution
92D10 Genetics and epigenetics
00B15 Collections of articles of miscellaneous specific interest
Full Text: DOI