Statistical genomics. Methods and protocols. (English) Zbl 1346.92003

Methods in Molecular Biology 1418. New York, NY: Humana Press/Springer (ISBN 978-1-4939-3576-5/hbk; 978-1-4939-3578-9/ebook). xi, 418 p. (2016).
This book is a timely compilation of short articles illustrating the state of art for handling of big data collections (such as Gene Expression Omnibus or the cancer genome atlas) and for obtaining interpreting gene quantifications. It also contains a review of publicly available and commonly used tools such as NGS-QC, BEDOPS, WEKA or edgeR. The book is structured in four parts covering both the introductory notions (Part 1, groundwork) and advanced topics on public genomic data in Part 2, on applications in Part 3, and tools in Part 4.
The first chapter contains an overview of data formats including fastq and fasta for sequencing data, bam and sam for alignments, gff/gtf and bed for feature annotations and vcf for encoding genomic variations. In the second chapter, the authors present the steps of exploratory data analysis on multiple -omics studies, e.g., multiple assays on mRNA, proteins, etc., conducted on the same biological samples. The authors first introduce the analysis of two datasets using a co-inertia approach, which is followed by a case study on NCI-60 cell line transcriptomic and proteomic data. Next, the authors generalize this approach for three or more datasets and provide as example the comparison of gene expression from four different microarray datasets. In the third chapter, the authors discuss the importance and effect of the design for sequencing studies. Following a description of the rationale and statistical background, the authors evaluate the effect of randomization and of sampling distributions. Next, they analyse the accuracy of the statistical testing, i.e., the power of the experiment to reveal the significant differences and the confounding effect that makes difficult to determine which factor could be linked to the observed effect. Next, the authors propose alternatives such as pilot experiments (i.e., a restriction on the randomization) and level of replication. The fourth and last chapter in the groundwork section discusses genomic annotation resources in R and BioConductor. The authors focus on the annotationHub package and exemplify it on R. norvegicus and D. melanogaster data.
Part 2 contains a description of two of the most widely used databases: GEO and the cancer genome atlas (TCGA). In Chapter 5, the authors review the structure and contents of the GEO. They show how to perform a quick search using keywords and how to retrieve records. Methods for advanced searches and programmatic searches are also described (e.g., using the GEO2R library). In the sixth chapter, the authors describe how to retrieve and use TGCA data, the data types and data levels, i.e., raw data, processed samples or segmented and interpreted data.
Part 3, Applications, commences with an overview of exploratory analysis of oligonucleotide arrays. Following an overview of arrays types and applications, the author describes the handling of the data-import, pre-processing, normalization and summarization. The examples are on expression arrays, gene ST arrays and SNP arrays. In the eighth chapter, the authors overview the steps of meta-analysis in gene expression studies using a high grade ovarian cancer study as example. The authors commence with the retrieval of the data from GEO, the curation, pre-processing and evaluation of the batch effects. Next, they present fixed effects and mixed effects data analyses and conclude with a description of potential extensions to predictive modelling derived from the analysis of real data. In Chapter 9, the authors present a practical analysis of genome contact interaction experiments facilitated by recent developments in chromatin conformation capture (3C) coupled with next generation sequencing which led to Hi-C. Using the diffHiC to exemplify the steps of the analysis, the authors present the steps for identifying significant contacts using (i) a model based approach and (ii) interaction differences between two conditions. In the tenth chapter, the authors describe a quantitative comparison of large scale DNA enrichment followed by sequencing using DNA-IPseq, a method which has been successfully used for the identification of epigenetic alteration in various diseases. To exemplify the analysis, the authors use the MEDIPS package (in R/BioConductor). Following a brief overview of parameter settings and quality checks, the differential coverage analysis is described in detail. The R scripts, fully commented, for the analysis of the H3K4me2 Chip-seq data concludes the chapter. The eleventh chapter is focused on variant calling from next generation sequencing (NGS) data. The author commences with an overview of statistical methods available for this task and an evaluation of their accuracy. Next, using an example dataset, the individual steps of the analysis are presented. The chapter concludes with approaches for viewing and filtering variants. In the twelfth chapter, the authors focus on the genome scale analysis of cell specific regulatory codes obtained using nuclear enzymes which were made possible thanks to the genome wide profiling of chromatin features at nucleotide resolution. First, the authors describe the details for the assay used for quantifying chromatin accessibility including its biases and reproducibility. Next, the resulting data types, their features and available browsers for visualization are discussed. Lastly, the authors focus on the analysis of transcription factors (TFs) footprints and of the advantages and limitations of sequence motif analyses.
In the thirteenth chapter, the authors present NGS-QC Generator which is a set of quality check tools designed for Chip-Seq data and applicable to other related datasets such as RNA-seq, GRO-seq, DNase-seq, FAIRE-seq, MNase-seq or HiC and ChIA-PET. Following a brief overview of the field and a description of the datasets used as examples, the authors present in detail the quality assessment of the sequencing, conducted using FastQC. Next the use of the NGS-QC generator portal via the dedicated Galaxy instance is discussed, with a focus on global quality control analyses and approaches to visualize local enrichment patterns in the context of their quality. The chapter concludes with a description of the generator database.
In the fourteenth chapter, the authors present BEDOPS, a toolkit for querying, analysing and comparing genomic datasets of variable sizes using custom pipelines efficient both in terms of memory usage and runtime (the toolkit is exemplified on ChIP-seq data). First, the authors describe the genomic data formats accepted in BEDOPS; next the pipelining with streams is presented, followed by an overview of the core functionalities. The chapter concludes with a series of suggestions for working efficiently with big data, such as the use of parallelization, sorting and compression.
In the fifteenth chapter, the authors describe GMAP and GSNAP, efficient tools in terms of speed, accuracy and functionality for aligning RNA-seq and DNA-seq outputs to the reference genome. The chapter is focused on the significant enhancement obtained by representing the genome using single-instruction multiple-data (SIMD) instructions, compressed genome hash tables and enhanced suffix arrays (ESAs). The chapter commences with an overview of the algorithms used in these tools and a brief history of their development. Next, the computational features are discussed, including the tolerance to SNPs and applicability to various sequencing data including bisulfite and PAR-CLIP sequencing. The handling of chimeric alignments, distant splicing and alternate scaffolds is also presented.
In the sixteenth chapter, the authors present the Gviz package in R/Bioconductor which is a flexible framework for visualizing genomic data when multiple annotation features are available. The authors present first technical details on the installation of the library and briefly describe the D. melanogaster data used to exemplify Gviz. Next, the tracking of objects and adjustment of display parameters is discussed, first theoretically ad then as a full hands-on example.
The seventeenth chapter is focused on the machine-learning platform WEKA. Following a general introduction of choices presented to data mining scientists, such as the use of structures or instructed data or of supervised or unsupervised learning, the authors present a step-by-step description (with examples) of the tool itself. Also, a section is included on how to approach and handle a bioinformatics problem.
In the eighteenth chapter, the authors discuss a crucial component of the experimental design: the power calculation; the theoretical aspects and hands on examples are based on the R/Bioconductor package PROPER. The calculations are based on several assumptions including the amplitude of the effect size (translated as the differential expression between treatments), the within-group variation, the acceptable number of type I errors and the sample size. Next, the authors present a step-by-step analysis in PROPER commenting each step and technical decisions (such as the choice between more samples or deeper sequencing).
The nineteenth chapter offers a recipe for detecting differentially expressed transcripts in RNA-seq experiments, using edgeR. Following an overview of basic theory for this type of analyses, the authors introduce the modelling of variability using a quasi-negative binomial framework and the normalization for composition biases. Next, they exemplify this approach, with a detailed description of every step of the analysis, on real data (M. musculus). The chapter concludes with details on the advanced usage of edgeR including the handling of complicated contrasts and gene ontology enrichment analyses.
This collection of articles offers a thorough overview of the field, making it an opportune and useful addition to the literature. The book is written in an accessible language and the variety of the topics which are presented recommends it as an excellent starting point or updated reference of the field. It is suitable for both post-graduate and established researchers, and the numerous examples that accompany the discussed topics recommend it as an asset.


92-02 Research exposition (monographs, survey articles) pertaining to biology
92B15 General biostatistics
92D10 Genetics and epigenetics
62P10 Applications of statistics to biology and medical sciences; meta analysis
62B15 Theory of statistical experiments
62F15 Bayesian inference
62K05 Optimal statistical designs
Full Text: DOI