×

zbMATH — the first resource for mathematics

Removing technical variability in RNA-seq data using conditional quantile normalization. (English) Zbl 1437.62486
Summary: The ability to measure gene expression on a genome-wide scale is one of the most promising accomplishments in molecular biology. Microarrays, the technology that first permitted this, were riddled with problems due to unwanted sources of variability. Many of these problems are now mitigated, after a decade’s worth of statistical methodology development. The recently developed RNA sequencing (RNA-seq) technology has generated much excitement in part due to claims of reduced variability in comparison to microarrays. However, we show that RNA-seq data demonstrate unwanted and obscuring variability similar to what was first observed in microarrays. In particular, we find guanine-cytosine content (GC-content) has a strong sample-specific effect on gene expression measurements that, if left uncorrected, leads to false positives in downstream results. We also report on commonly observed data distortions that demonstrate the need for data normalization. Here, we describe a statistical methodology that improves precision by 42% without loss of accuracy. Our resulting conditional quantile normalization algorithm combines robust generalized regression to remove systematic bias introduced by deterministic features such as GC-content and quantile normalization to correct for global distortions.

MSC:
62P10 Applications of statistics to biology and medical sciences; meta analysis
Software:
oligo; gcrma; edgeR
PDF BibTeX Cite
Full Text: DOI
References:
[1] Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries, 12, R18 (2011)
[2] Differential expression analysis for sequence count data, 11, R106 (2010)
[3] A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, 19, 185-193 (2003)
[4] Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-seq and microarrays, 6, e17820 (2011)
[5] Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments, 11, 94 (2010)
[6] Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data, 8, 485 (2007) · Zbl 1144.62088
[7] Polymorphic cis- and trans-regulation of human gene expression, 8, e1000480 (2010)
[8] Substantial biases in ultra-short read data sets from high-throughput DNA sequencing, 36, e105 (2008)
[9] Digital gene expression signatures for maize development, 154, 1024 (2010)
[10] and others, 1), D800 (2011)
[11] Biases in Illumina transcriptome sequencing caused by random hexamer priming, 38, e131 (2010)
[12] Sequencing technology does not eliminate biological variability, 29, 572-573 (2011)
[13] The International HapMap Project, 426, 789-796 (2003)
[14] (2005)
[15] Cloud-scale RNA-sequencing differential expression analysis with Myrna, 11, R83 (2010)
[16] Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, 10, R25 (2009)
[17] Analysis of HIV-1 expression level and sense of transcription by high-throughput sequencing of the infected cell, 85, 6205-6211 (2011)
[18] Modeling non-uniformity in short-read rates in RNA-Seq data, 11, R50 (2010)
[19] The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements, 24, 1151-1161 (2006)
[20] RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays, 18, 1509-1517 (2008)
[21] Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation (2012)
[22] Transcriptome genetics using second generation sequencing in a Caucasian population, 464, 773-777 (2010)
[23] Mapping and quantifying mammalian transcriptomes by RNA-Seq, 5, 621-628 (2008)
[24] Solving the riddle of the bright mismatches: labeling and effective binding in oligonucleotide arrays, 68 (2003)
[25] and others, 65, 6071 (2005) · Zbl 1119.74437
[26] A genome-wide study of DNA methylation patterns and gene expression levels in multiple human and chimpanzee tissues, 7, e1001316 (2011)
[27] Understanding mechanisms underlying human gene expression variation with RNA sequencing, 464, 768-772 (2010)
[28] Improving RNA-Seq expression estimates by correcting for fragment bias, 12, R22 (2011)
[29] edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, 26, 139-140 (2010)
[30] A scaling normalization method for differential expression analysis of RNA-seq data, 11, R25 (2010)
[31] Moderated statistical tests for assessing differences in tag abundance, 23, 2881-2887 (2007)
[32] Small-sample estimation of negative binomial dispersion, with applications to SAGE data, 9, 321-332 (2008) · Zbl 1143.62312
[33] Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation, 28, 511-515 (2010)
[34] RNA-Seq: a revolutionary tool for transcriptomics, 10, 57-63 (2009)
[35] Subset quantile normalization using negative control features, 17, 1385-1395 (2010)
[36] A model-based background adjustment for oligonucleotide expression arrays, 99, 909-917 (2004) · Zbl 1055.62129
[37] Gene expression profiling of human breast tissue samples using SAGE-Seq, 20, 1730 (2010)
[38] A model of molecular interactions on short oligonucleotide microarrays, 21, 818-821 (2003)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.