×

Statistical analysis of next generation sequencing data. (English) Zbl 1296.92001

Frontiers in Probability and the Statistical Sciences. Cham: Springer (ISBN 978-3-319-07211-1/hbk; 978-3-319-07212-8/ebook). xiv, 432 p. (2014).
This book is an excellent collection of 20 chapters presenting the state of art (as of 2014) of algorithms developed for the analysis of next generation sequencing (NGS) data.
The book commences (Susmita Datta, Somnath Datta, Ryan Gill, Riten Mitra, “Statistical analyses of next generation sequencing data: an overview”, pp. 1–24 ) with an overview of commonly used sequencing technologies and their various applications. Starting with the description of DNA, seen as unit of sequencing, the chapter continues with a presentation of the steps for library preparation, amplification, tagging and sequencing. Next, the downstream applications are reviewed, including de novo assembly and expression quantification. The next section presents in more detail the input and output of NGS platforms like SOLiD, Illumina/Solexa, Ion Semiconductor and single-molecule real-time sequencing. It is followed by an overview of statistical tools for processing the resulting reads for data quality and reproducibility tests, for base calling and alignment and assembly tools and it concludes with some R and Bioconductor packages that can be used for data processing.
The second chapter (Susmita Datta, Ryan S. Gill, Douglas J. Lorenz, Ritendranath Mitra, “Using RNA-seq data to detect differentially expressed genes”, pp. 25–49) focuses on one type of NGS, the RNAseq, and, following a brief description of normalization methods, reviews the methods developed for the identification of differentially expressed (DE) genes. First, simple approaches such as the likelihood ratio test (LRT), the Fisher exact test or \(t\)-tests of maximum likelihood estimates (MLEs) are presented in tandem with the R/Bioconductor packages that use them such as DEGseq. Next, tests based on extensions of the Poisson distribution are reviewed. Methods such as the two-stage Poisson model (TSPM) or the methods based on an adaptive histogram estimator of empirical Bayes probability of no differential expression and no over-dispersion are discussed in detail. The section concludes with quasi-likelihood tests based on negative binomial distributions, other non-parametric approaches such as a modified Wilcoxon test or a Markov random field approach and Bayesian and empirical Bayes approaches. The following section is dedicated to an overview of R packages for DE in RNAseq data and include details on GPseq, DEGseq, edgeR, SAMseq, BBseq and BaySeq. The chapter concludes with comparisons of these methods for detecting DE either in published reviews or using the authors’ own data. The conclusion stated in various studies is underlined that these methods try to control the false discovery rates and the type I errors, but are still not coping well with the known problems deriving from variability of the data and lack of more replicates.
The third chapter (Yunshun Chen, Aaron T. L. Lun, Gordon K. Smyth, “Differential expression analysis of complex RNA-seq experiments using edgeR”, pp. 51–74) presents the underlying details of edgeR. It commences with a section on the use of the negative binomial model and discusses in detail the summarization of gene abundances in a count matrix, a way of distinguishing technical variation from biological one, and shows how generalized linear models can be used to accommodate complex experimental designs with multiple explanatory factors. The next section focuses on the estimation of empirical Bayes dispersion. The Cox-Reid adjusted profile likelihood is presented, followed by a weighted likelihood empirical Bayes method. The chapter concludes with a case study on the transcriptional program regulation by IRF4. The experimental design and all intermediary steps (genome alignment, gene expression estimation, filtering, normalization, data exploration and DE analysis) are presented in detail.
The fourth chapter (Andrea Riebler, Mark D. Robinson, Mark A. van de Wiel, “Analysis of next generation sequencing data using integrated nested Laplace approximation (INLA)”, pp. 75–91) presents a method to analyze NGS data using integrated nested Laplace approximation (INLA) and its corresponding R package r-inla. The chapter commences with the theoretical background for the deterministic framework for Bayesian inference in latent Gaussian models, followed by a description of the main functions in the r-inla package. Next, the authors show how to combine INLA with empirical Bayes and introduce the Bayesian multivariate shrinkage. The chapter concludes with a proof that multivariate shrinkage can improve feature selection at a given FDR and with a case study on an RNAseq analysis of lymphoblastoid cell lines.
The fifth chapter (Dan Nettleton, “Statistical analysis of next generation sequencing data”, pp. 93–113) discusses strategies for the design of RNAseq experiments aimed at increasing the biological relevance of experiments. The author starts with an overview of replication strategies and continues with a discussion on the tradeoff between sequencing depth and number of replicates. The chapter concludes with three examples of experiment design that were applied to biological problems: experiments with four treatments, a split plot experiment and a balance-incomplete block design.
The sixth chapter (Leonardo Collado Torres, Alyssa C. Frazee, Andrew E. Jaffe, Ben Langmead, Jeffrey T. Leek, “Measurement, summary, and methodological variation in RNA-sequencing”, pp. 115–128) is centered on the gene expression level, and discusses issues on the measurement, summarization of intrinsic variation in RNAseq data. The authors commence with the analysis of the latter splitting it into variation across groups, measurement error and biological variation. Each of these sources is then discussed in detail. The authors continue with the variability in summarization methods; the differentiation between spliced and non-spliced isoforms is discussed and the alteration such as feature summarization of genome matching reads is also presented. Next, the different answers provided by statistical tests are debated and the chapter concludes with a list of open problems.
The seventh chapter (Peter J. Bickel, Nathan Boley, James Bentley Brown, Haiyan Huang, Hao Xiong, “DE-FPCA: testing gene differential expression and exon usage through functional principal component analysis”, pp. 129–143) is dedicated to yet another method for the identification of DE genes and variable exon usage which is based on functional principal component analysis. The authors start with the theoretical background of the approach followed by an example on a set of samples from fly heads sequenced paired-end on Illumina IIx and HiSeq2000. On this example, the authors also discuss the robustness of the approach.
In the eighth chapter (Yijuan Hu, Wei Sun, “Mapping of expression quantitative trait loci using RNA-seq data”, pp. 145–168) the authors discuss the mapping of reads on quantitative trait loci (QTLs) focusing on allele-specific expression (ASE) and isoform-specific expression (ISE). The chapter commences with the ASE issue and the haplotype phasing, sequence mapping bias and the expected allele-specific read count (ASReC) are discussed in detail. Next, the ASE for cis-eQTL mapping is presented, followed by isoform-specific qQTL mapping which includes transcriptome reconstruction and calculation of effective length. The chapter concludes with a discussion on quality control and the effect of possible non-genetic factors as well as the genetic architecture of gene expression.
The ninth chapter (Sandrine Dudoit, John Ngai, Davide Risso, Terence P. Speed, “The role of spike-in standards in the normalization of RNA-seq”, pp. 169–190) discusses the differential expression from the normalization perspective. The authors present in detail the spike-in-based normalization as a potential solution for very variable read counts. The approach is described on a zebra fish dataset. The chapter commences with an overview of normalization methods which include the global scaling normalizations, the non-linear one and one based on the abundances of control sequences. Next, a general framework that will lead to DE is introduced, together with the respective R packages: affy, DEseq, EDAseq, edgeR and RUVseq. The efficiency of the method is then presented on a dataset and the impact of the normalization on the DE call discussed in detail.
The tenth chapter (Peng Liu, Yaqing Si, “Cluster analysis of RNA-sequencing data”, pp. 191–217) presents cluster analysis adapted to the characteristics of RNAseq data. Following the description of two discrete distributions proposed for the modeling of RNAseq data, the Poisson and the negative binomial, the authors review standard dissimilarity measures and methods like \(k\)-means and hierarchical clustering, MB-EM and hybrid hierarchical clustering are discussed in detail. The chapter concludes with a series of case studies providing also a comparison of the methods and implementation details.
The eleventh chapter (Ashley Petersen, Kean Ming Tan, Daniela Witten, “Classification of RNA-seq data”, pp. 219–246) discusses the classification issue in the RNAseq context. It consists of descriptions of frequently used classification methods like linear regression, linear discriminant analysis in low and high dimensions, principal component classification, partial least squares for classification and support vector machines (SVMs). These are followed by a brief presentation of normalization methods and their effect on classification. The chapter concludes with the evaluation of the classification approaches and a series of case studies on prostate cancer and cervical cancer.
In the twelfth chapter (Hongzhe Li, “Isoform expression analysis based on RNA-seq data”, pp. 247–259) the author investigates the quantification of isoforms using RNAseq data. The chapter commences with a section based on the assumption that the isoforms are known which includes approaches taking into account non-uniform sampling and methods for the simultaneous discovery and quantification of isoforms. The chapter concludes with a section on allele specific transcript quantification.
The thirteenth chapter (Julia Salzman, “RNA isoform discovery through goodness of fit diagnostics”, pp. 261–276) focuses on the RNA isoform discovery through goodness of fit diagnostics. Following introductory aspects on the biological and statistical background, the author introduces the Poisson model framework and discusses the isoform detection when mismatches are allowed. Next, the model selection via the sampling rate matrix is introduced and she discusses in detail the modeling of alignment quality, the residual analysis and the detection of lack of fit.
The fourteenth chapter (Dongjun Chung, Sündüz Keleş, Qi Zhang, “MOSAiCS-HMM: a model-based approach for detecting regions of histone modifications from ChIP-Seq data”, pp. 277–295) introduces a new type of NGS data, the ChIPseq and a model-based approach for the detection of regions showing histone modifications. Following an introduction offering a description of the typical workflow of statistical analysis of ChIPseq experiments, the authors present the MOSAiCS-HMM model and its parameter estimations. The chapter concludes with a case study of H3K4me3 profiling in GM12878 cells.
The fifteenth chapter (Riten Mitra, Peter Müller, “Hierarchical Bayesian models for ChIP-seq data”, pp. 297–314) presents hierarchical Bayesian models also on ChIPseq data. The authors start with the description of conditional independence structure using a graphical model, followed by a bi-clustering method for the understanding of the histone code. The case study consists of a biological example of co-clustering of ChIPseq data.
The sixteenth chapter (Kui Zhang, Degui Zhi, “Genotype calling and haplotype phasing from next generation sequencing data”, pp. 315–333) discusses genotype calling and haplotype phasing. Following a description of the overall pipeline analysis and the introduction of basic notations, the authors continue with a discussion of single-site genotype likelihood. Next, multisample calling is presented and the joint likelihood, maximum likelihood estimation, estimation of the number of non-reference alleles and variant detection are presented in detail. The next section is on multi-site multi-sample methods for which the HMM model HapSeq is described in detail.
The seventeenth chapter (Ruofei Du, Zhide Fang, “Analysis of metagenomic data”, pp. 335–353) focuses on the analysis of metagenomic data. Commencing with a brief overview of metagenomic studies, the authors present next some statistical analyses conducted on this type of data. The sufficiency of sample size, the metagenomic binning and the assessment of accuracy are described in detail. Methods for the adjustment of the resulting profiles to allow comparisons are presented next, e.g., beta binomial approach, overdispersed logistic regression approach, overdispersed log-linear regression approach and nonparametric \(t\)-test.
The eighteenth chapter (Venkatraman E. Seshan, “Detecting copy number changes and structural rearrangements using DNA sequencing”, pp. 355–378) focuses on the detection of copy number changes and structural rearrangements. The method presented is circular binary segmentation and its use is shown on an example for breast cancer cell line HCC1143.
The nineteenth chapter (Mengjie Chen, Lin Hou, Hongyu Zhao, “Statistical methods for the analysis of next generation sequencing data from paired tumor-normal samples”, pp. 379–404) reviews statistical approaches for the analysis of paired data (e.g. tumor-normal samples). For the detection of single nucleotide aberrations both a heuristic method and a statistical method based on a Bayesian framework are presented. The detection of copy number aberration is described linked to the GC content and the mappability issue. The CAN identification by change-point detection methods is exemplified using seqCBS and BICseq. These approaches are then discussed on a case study of TCGA benchmark dataset.
The twentieth chapter (Debashis Ghosh, Santhosh Girirajan, “Statistical considerations in the analysis of rare variants”, pp. 405–422) discusses the analysis of rare variants using kernel machine methodology. Following a description of the theoretical background, the authors discuss the SKAT example and the benefits of multiple testing.
This book is a valuable and well-timed collection of articles on the statistical methods that can be applied on NGS data. Even if no prior NGS knowledge is required, the book is addressed mainly to researchers at postgraduate and post-doc levels.

MSC:

92-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to biology
92-02 Research exposition (monographs, survey articles) pertaining to biology
92-06 Proceedings, conferences, collections, etc. pertaining to biology
92-08 Computational methods for problems pertaining to biology
92C40 Biochemistry, molecular biology
92D20 Protein sequences, DNA sequences
62P10 Applications of statistics to biology and medical sciences; meta analysis
PDFBibTeX XMLCite
Full Text: DOI