Computational genomics with R. With the assistance of Verdan Franke, Bora Uyar and Jonathan Ronen. (English) Zbl 1482.92002

Chapman & Hall/CRC Computational Biology Series. Boca Raton, FL: CRC Press (ISBN 978-1-4987-8185-5/hbk; 978-0-429-08431-7/ebook). xxii, 440 p. (2021).
The book spans a broad range of fundamental notions that form the basis of modern genomics analyses. The book is structured in 11 chapters, split into three semantic parts: an overview of genomics and an introduction to R, in the given context, a summary of machine learning approaches, and an illustration of analyses steps for RNAseq, epigenetics and multi-omics pipelines.
The book commences with two introductory chapters, the first one on genomics concepts, and the second on R, as programming environment. The former overviews the central dogma of molecular biology, underlining key concepts such as genome, gene, transcriptional and post-transcriptional regulation. High-throughput experimental approaches for quantifying the various modalities are also introduced. The latter chapter starts with an outline of data analysis steps and an argument in favour of using R as analysis environment. Basic concepts such as data structures and data types, input and output of data (including visualisation approaches) are presented.
The second part of the book focuses on statistics and machine learning approaches; it spans three chapters. In chapter three statistical tests for assessing differences between samples (distributions) are overviewed. Summarising approaches such as mean, median, variation, confidence intervals are backed up with details on t-tests and multiple testing corrections; also included are linear models and correlations. In the fourth chapter the author focuses on unsupervised learning and presents details on clustering approaches (distance metrics, and the hierarchical and k-mean clustering) and dimensionality reduction (principal component analysis, multi-dimensional scaling and t-distributed stochastic neighbour embedding, tSNEs). The fifth chapter is built on supervised machine learning models, illustrated using a case study on disease subtypes derived from genomics data. The standard steps for data pre-processing (transformations, filtering, scaling and handling of missing values) are followed by concepts on cross-validation and bootstrapping. Additional details on parameter tuning, class imbalance and dealing with correlated predictors are also included. In terms of models, decision trees and random forests contrasted with logistic regression are presented. Also mentioned are gradient boosting, support vector machines, ensemble learning, and brief details on neural networks and deep learning.
The third part of the book focuses on approaches for handling high throughput datasets. In Chapter 6 the author overviews functionalities of the GenomicRanges package in R; also discussed are method for summarising expression and visualising the outputs (karyograms and circos plots).
The RNAseq overview spans Chapters 7 and 8; for the former, the focus lies on quality-checking, pre-processing and aligning reads; following an overview of the fastq and fasta formats, approaches for assessing the read quality are discussed (quality scores, nucleotide composition). Options for read filtering and trimming are also included, next to alternatives for mapping the reads on reference genomes/ transcriptomes. In Chapter 8 the author overviews the standard steps of quantifying gene expression, normalisation and assessment of differential expression. The chapter concludes with enrichment analyses and other approaches for accounting for variation within measurements.
The epigenetic high throughput experiments are discussed in Chapters 9 and 10 (corresponding to ChIPseq and bisulphite sequencing, respectively). In Chapter 9 the protein/DNA interactions and sources of experimental biases (antibody specificity, PCR amplification, sequencing depth) are introduced first. Next, the pre-processing and quality control steps are introduced, with an emphasis on biases side-effects (visualised in genome browsers, and quantified as across-strand cross-correlation and GC bias). Also included are approaches for calling narrow and broad peaks, and the subsequent biological interpretation via motif discovery. Chapter 10 presents in detail the experiment characteristics of bisulphite sequencing. The particularities of the methylation files (and calls) precede a pipeline for differential methylation and segmentation.
The book concludes with a chapter on integrating multiple high-throughput datasets (as example a multi-omics dataset from colorectal cancer is put forward). An overview of latent variable models for multi-omics integration is followed by a summary of matrix factorisation approaches for unsupervised integration; theoretical aspects of multiple factor analysis and joint non-negative matrix factorisation are backed up with details of the iCluster approach. The unsupervised methods are further expended with an outline of one-hot clustering and an illustration of the k-means clustering in this context. The chapter ends with proposed biological interpretations of the latent factors (assessment of loading vectors, enrichment analyses and role of additional covariates).
All chapters include exercises that strengthen the concepts presented to examples and case studies. The book is thoroughly decorated with references, that recommend it to a wide audience, and making this an excellent starting point for novices and experienced users alike.


92-02 Research exposition (monographs, survey articles) pertaining to biology
92D10 Genetics and epigenetics
62P10 Applications of statistics to biology and medical sciences; meta analysis
Full Text: DOI