Quantifying alternative splicing from paired-end RNA-sequencing data. (English) Zbl 1454.62388

Ann. Appl. Stat. 8, No. 1, 309-330 (2014); corrigendum ibid. 9, No. 3, 1706-1707 (2015).
Summary: RNA-sequencing has revolutionized biomedical research and, in particular, our ability to study gene alternative splicing. The problem has important implications for human health, as alternative splicing may be involved in malfunctions at the cellular level and multiple diseases. However, the high-dimensional nature of the data and the existence of experimental biases pose serious data analysis challenges. We find that the standard data summaries used to study alternative splicing are severely limited, as they ignore a substantial amount of valuable information. Current data analysis methods are based on such summaries and are hence suboptimal. Further, they have limited flexibility in accounting for technical biases. We propose novel data summaries and a Bayesian modeling framework that overcome these limitations and determine biases in a nonparametric, highly flexible manner. These summaries adapt naturally to the rapid improvements in sequencing technology. We provide efficient point estimates and uncertainty assessments. The approach allows to study alternative splicing patterns for individual samples and can also be the basis for downstream analyses. We found a severalfold improvement in estimation mean square error compared popular approaches in simulations, and substantially higher consistency between replicates in experimental data. Our findings indicate the need for adjusting the routine summarization and analysis of alternative splicing RNA-seq studies. We provide a software implementation in the R package casper.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
62-08 Computational methods for problems pertaining to statistics
Full Text: DOI arXiv Euclid


[1] Ameur, A., Wetterbom, A., Feuk, L. and Gyllensten, U. (2010). Global and unbiased detection of splice junctions from RNA-seq data. Genome Biol. 11 R34.
[2] Blencowe, B. J. (2006). Alternative splicing: New insights from global analyses. Cell 126 37-47.
[3] Casella, G. and Berger, R. L. (2001). Statistical Inference , 2nd ed. Duxbury, N. Scituate. · Zbl 0699.62001
[4] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1-38. · Zbl 0364.62022
[5] ENCODE Project Consortium (2004). The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306 636-640.
[6] Glaus, P., Honkela, A. and Rattray, M. (2012). Identifying differentially expressed transcripts from RNA-seq data with biological variation. Bioinformatics 28 1721-1728.
[7] Guttman, M., Garber, M., Levin, J. Z., Donaghey, J., Robinson, J., Adiconis, X., Fan, L., Koziol, M. J., Gnirke, A., Nusbaum, C., Rinn, J. L., Lander, E. S. and Regev, A. (2010). Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nature Biotechnoly 28 503-510.
[8] Holt, R. A. and Jones, S. J. M. (2008). The new paradigm of flow cell sequencing. Genome Research 18 839-846.
[9] Jiang, H. and Wong, W. H. (2009). Statistical inferences for isoform expression in RNA-Seq. Bioinformatics 25 1026-1032.
[10] Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc. 53 457-481. · Zbl 0089.14801
[11] Katz, Y., Wang, E. T., Airoldi, E. M. and Burge, C. B. (2010). Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7 1009-1015.
[12] Lacroix, V., Sammeth, M., Guigo, R. and Bergeron, A. (2008). Exact Transcriptome Reconstruction from Short Sequence Reads. In Proceedings of the 8 th International Workshop on Algorithms in Bioinformatics . 50-63. Springer, Berlin.
[13] Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10 R25.
[14] Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25 1754-1760.
[15] Li, R., Yu, C., Li, Y., Lam, T. W., Yiu, S. M., Kristiansen, K. and Wang, J. (2009). SOAP2: An improved ultrafast tool for short read alignment. Bioinformatics 25 1966-1967.
[16] Montgomery, S. B., Sammeth, M., Gutierrez-Arcelus, M., Lach, R. P., Ingle, C., Nisbett, J., Guigo, R. and Dermitzakis, E. T. (2010). Transcriptome genetics using second generation sequencing in a Caucasian population. Nature 464 773-777.
[17] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and B., W. (2008). Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 5 621-628.
[18] Pepke, S., Wold, B. and Mortazavi, A. (2009). Computation for ChIP-seq and RNA-seq studies. Nat. Methods 6 S22-S32.
[19] Roberts, A., Trapnell, C., Donaghey, J., Rinn, J. L. and Pachter, L. (2011a). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biol. 12 R22.
[20] Roberts, A., Pimentel, H., Trapnell, C. and Pachter, L. (2011b). Identification of novel transcripts in annotated genomes using RNA-Seq. Bioinformatics 27 2325-2329.
[21] Rogers, M. F., Thomas, J., Reddy, A. S. and Ben-Hur, A. (2012). SpliceGrapher: Detecting patterns of alternative splicing from RNA-Seq data in the context of gene models and EST data. Genome Biol. 13 R4.
[22] Rossell, D., Stephan-Otto Attolini, C., Kroiss, M. and Stöcker, A. (2014). Supplement to “Quantifying alternative splicing from paired-end RNA-sequencing data.” . · Zbl 1454.62388
[23] Salzman, J., Jiang, H. and Wong, W. H. (2011). Statistical modeling of RNA-Seq data. Statist. Sci. 26 62-83. · Zbl 1219.62173
[24] Therneau, T. and Lumley, T. (2011). Survival: Survival analysis, including penalised likelihood. R package version 2.36-10.
[25] Trapnell, C., Pachter, L. and Salzberg, S. L. (2009). TopHat: Discovering splice junctions with RNA-Seq. Bioinformatics 25 1105-1111.
[26] Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., van Baren, M. J., Salzberg, S. L., Wold, B. J. and Pachter, L. (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28 511-515.
[27] Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L. and Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols 7 562-578.
[28] Wu, Z., Wang, X. and Zhang, X. (2011). Using non-uniform read distribution models to improve isoform expression inference in RNA-Seq. Bioinformatics 27 502-508.
[29] Wu, J., Akerman, M., Sun, S., McCombie, W. R., Krainer, A. R. and Zhang, M. Q. (2011). SpliceTrap: A method to quantify alternative splicing under single cellular conditions. Bioinformatics 27 3010-3016.
[30] Xing, Y., Yu, T., Wu, Y. N., Roy, M., Kim, J. and Lee, C. (2006). An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs. Nucleic. Acids Res. 34 3150-3160.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.