Computational exome and genome analysis. (English) Zbl 1384.92004

Chapman & Hall/CRC Mathematical and Computational Biology Series. Boca Raton, FL: CRC Press (ISBN 978-1-4987-7598-4/hbk; 978-1-4987-7599-1/ebook). xxi, 552 p. (2018).
This book is an excellent example of a hybrid between a textbook and an up-to-date research reference on the latest bioinformatics tools available in this field. Its rigorous and thorough approach makes it a reliable starting point for bioinformaticians and biologists. By including details on methodological aspects of some of the algorithms used for various components of the data analysis and coupling these with fully-commented examples and exercises, this book presents itself as a must-have for novices and experts alike. Given the fast pace of the field, no book can be exhaustive, however, the wide variety of tools presented here recommend it to a wide audience, both as expertise and focused research interests.
The book consists of seven parts focused on investigating (and data mining) the human genome for scientific and medicine related questions; Mendelian diseases and the use of precision medicine are a recurrent theme throughout the chapters; the framework of the book is described in the first chapter. The authors start with an overview of sequencing history from Sanger sequencing to Next Generation Sequencing and Illumina technologies all in the light of Moore’s law. The third chapter is built as a detailed description of the Illumina sequencing and includes elements of the library preparation with its particular steps: fragmentation, repair adenylation and adapter ligation. The flow cell preparation and the individual steps for the sequencing by synthesis are thoroughly presented. In the fourth chapter, the whole genome and whole exome sequencing (WGS and WES, respectively) are introduced using as example the Corpasome, i.e., genomic data from the Corpas family, publicly available since 2012. The step-by-step WES/WGS analysis is presented in detail including the commands for downloading and processing the data.
The second part of the book is dedicated to raw data processing; it starts with a detailed overview of the fastQ format, including the description of phred scores. Next, the authors present some quality checks such as base quality, nucleotide distribution, GC content distribution, duplication rate and contamination with the sequencing adapter; the interpretation of the \(k\)-mer content and the per-tile sequence quality is also included. Chapter 6 is built as a description of the fastQC tool developed at the Babraham Institute. The last chapter in this section revolves around trimming, i.e., removing of sequencing artefacts, namely sequencing adapters, before the data analysis. The tool presented for this task is trimmomatic (Java-based). A discussion on the usefulness of trimming and on the usage of other tools such as trimadapt and SAMtools is included.
The third part of the book focuses on alignment tools; the SAM and BAM formats are introduced and approaches for the quality control of the alignment data are discussed. In Chapter 8, the authors describe the mapping of reads to a reference genome or transcriptome. The examples make use of the BWA-MEM. An overview of the human genome reference with details on the availability of sequences is presented next to the Burrows-Wheeler transform used for the mapping. In the ninth chapter, the sequence alignment map (SAM) and the binary alignment map (BAM) are introduced. A full description of the output for single-end and paired-end reads is included; the cigar string is also presented next to the interpretation of the mapping quality output. The ninth chapter describes the post-processing of alignments using Picard tools; methods for realigning of reads and for base quality score recalibration are also presented. The last chapter in this section describes the quality-control of alignment data on depth and coverage. A detailed description of coverage analysis using the browser extensible data (BED) is presented; a script to create a coverage plots in R is included.
Part 4 is built on approaches for variant calling. Chapter 12 focuses on variant calling using the GATK tool, more specifically the Haplotype caller module which is suitable for both single and multiple sample analysis and the BCFtools. The hard filtering option as well as the variant quality score recalibration (VQSR) are discussed with examples. The chapter concludes with an analysis on the concordance of variant callers. The output of variant calling tools, the VCF (variant calling format) is presented at large in Chapter 13. The features and approaches for a variant normalisation are also included. The next chapter presents Jannovar, a stand-alone Java application for the identification of transcripts affected by a given variant. The tool is applicable for variants in either coding or non-coding transcripts and can be used to perform pedigree analyses for the identification of Mendelian disorders. In Chapter 15, the authors present the standards set by the Human Genome Variation Society (HGVS), including the numbering conventions, the annotation of files and the variant categories. Chapter 16 focuses on the quality control of variant calling; it includes a description of the transition-transvertion ratio and the proportion of other variants. Chapter 17 presents a Java-based integrative genomics viewer (IGV) for visualising alignments and variants with approaches for recognising poor quality alignments described using examples. In Chapter 18, a method for the identification of de novo variants is discussed which is based on single sample calling or joint calling. In the last chapter, the authors focus on structural variation including causes for structural variation and known categories copy number variants, inversions, and translocations. Tools like conifer, cnvator and DELLY analysis are presented as examples.
Part 5 focuses on variant filtering. In Chapter 20, the authors present pedigree and linkage analyses, starting with an overview of locations sets, of pedigree symbols and types of files. Analyses of homozygous and heterozygous variants (examples of X chromosomal recessive pedigrees) are presented coupled with the annotation of vcf files using Jannovar. In the next chapter, the authors present to some rare variant association studies (RVAS), followed by the variant frequency analysis and its integration with Jannovar presented in Chapter 22. In Chapter 23, the authors discuss the prediction of variant pathogenicity starting with criteria for deleteriousness of a variant. The effects on proteins and on RNA and DNA are examined. The tool MutationTester is presented as an example for pathogenicity prediction.
Part 6 is built on approaches for gene prioritization based on random walks methods for phenotype analyses. Chapter 24 describes variant prioritization, an algorithm to determine the likelihood that a disease gene is found. The authors integrate functional variation, gene expression and pathway annotation to evaluate further the priority of genes and for determining a diagnosis. In Chapter 25, the random walk analysis for the prioritization of genes is introduced. The effect of direct protein-protein interactions for determining disease gene families and the advantage of selecting the shortest path between interacting proteins is also examined. Next, the authors present phenotype analyses starting with an overview of the human phenotype ontology (HPO), approaches for the interpretation of annotations, and the integration with other disease databases. The semantic similarity of items annotated by ontology terms and the statistical significance of semantic similarity scores is also discussed; as an example, the phenogramviz is included. In Chapter 27, the authors present two software suites, exomiser and genomiser, that enable an all-inclusive phenotype driven analysis whole exome and whole genome sequencing data. The phive algorithms are presented and a full tutorial is also included, coupled with the integration with ExomeWalker, described in Chapter 25. The last chapter of this part focuses on the medical interpretation of the results using examples to highlight the effect of single exon deletions, of mutations in enhancers, of repeat expansions or structural variations.
Part 7 focuses on cancer studies and it commences with a short introduction to cancer characteristics, somatic variants in the light of tumour evolution, sample purity, driver mutations and mutational signatures. In Chapter 29, an overview of the basics of tumour biology and its integration with hereditary cancers syndromes is presented; databases frequently used in cancer bioinformatics are also included. Chapter 30 focuses on the analysis of somatic variants in cancer and exemplifies, on glioblastoma data, VarScan2, a tool for variant calling based on pileup files resulting from SAMtools. In Chapter 31, the authors present approaches for the estimation of tumour purity and clonality using the PurityEst algorithm. Chapter 32 discusses driver mutations and mutational signatures coupled with their integration into recurrently mutated pathways. The mathematical description is presented in tandem with the SomaticSignatures library in R.
The book is written for biologists, bioinformaticians, computational biologists or computer scientists who would like to either initialize their study on the computational analysis of human whole-exome and whole genome sequencing or be exposed to alternative analysis approaches. The examples assume a comfortable use of the command line and of compilation and execution of scripts.


92-02 Research exposition (monographs, survey articles) pertaining to biology
92D10 Genetics and epigenetics
92D20 Protein sequences, DNA sequences
92-04 Software, source code, etc. for problems pertaining to biology
Full Text: Link