Empirical Bayes analysis of RNA sequencing experiments with auxiliary information. (English) Zbl 1435.62391

Summary: Finding differentially expressed genes is a common task in high-throughput transcriptome studies. While traditional statistical methods rank the genes by their test statistics alone, we analyze an RNA sequencing dataset using the auxiliary information of gene length and the test statistics from a related microarray study. Given the auxiliary information, we propose a novel nonparametric empirical Bayes procedure to estimate the posterior probability of differential expression for each gene. We demonstrate the advantage of our procedure in extensive simulation studies and a psoriasis RNA sequencing study. The companion R package calm is available at Bioconductor.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62G08 Nonparametric regression and quantile regression
92D20 Protein sequences, DNA sequences
Full Text: DOI Euclid


[1] Andreassen, O. A., Djurovic, S., Thompson, W. K., Schork, A. J., Kendler, K. S., O’Donovan, M. C., Rujescu, D., Werge, T., van de Bunt, M. et al. (2013). Improved detection of common variants associated with schizophrenia by leveraging pleiotropy with cardiovascular-disease risk factors. Am. J. Hum. Genet. 92 197-209.
[2] Benidt, S. and Nettleton, D. (2015). Simseq: A nonparametric approach to simulation of RNA-sequence datasets. Bioinformatics 31 2131-2140.
[3] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. · Zbl 0809.62014 · doi:10.1111/j.2517-6161.1995.tb02031.x
[4] Blanchard, G. and Roquain, É. (2009). Adaptive false discovery rate control under independence and dependence. J. Mach. Learn. Res. 10 2837-2871. · Zbl 1235.62093
[5] Brzyski, D., Peterson, C. B., Sobczyk, P., Candès, E. J., Bogdan, M. and Sabatti, C. (2017). Controlling the rate of GWAS false discoveries. Genetics 205 61-75.
[6] Cai, T. T. and Sun, W. (2009). Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks. J. Amer. Statist. Assoc. 104 1467-1481. · Zbl 1205.62005 · doi:10.1198/jasa.2009.tm08415
[7] Collado-Torres, L., Nellore, A., Kammers, K., Ellis, S. E., Taub, M. A., Hansen, K. D., Jaffe, A. E., Langmead, B. and Leek, J. T. (2017). Reproducible RNA-seq analysis using recount2. Nat. Biotechnol. 35 319-321.
[8] Craven, P. and Wahba, G. (1978). Smoothing noisy data with spline functions. Numer. Math. 31 377-403. · Zbl 0377.65007 · doi:10.1007/BF01404567
[9] Du, L. and Zhang, C. (2014). Single-index modulated multiple testing. Ann. Statist. 42 30-79. · Zbl 1297.62217 · doi:10.1214/14-AOS1222
[10] Efron, B. (2004). Large-scale simultaneous hypothesis testing: The choice of a null hypothesis. J. Amer. Statist. Assoc. 99 96-104. · Zbl 1089.62502 · doi:10.1198/016214504000000089
[11] Efron, B. (2007). Size, power and false discovery rates. Ann. Statist. 35 1351-1377. · Zbl 1123.62008 · doi:10.1214/009053606000001460
[12] Efron, B. (2008). Simultaneous inference: When should hypothesis testing problems be combined? Ann. Appl. Stat. 2 197-223. · Zbl 1137.62010 · doi:10.1214/07-AOAS141
[13] Efron, B. (2010). Correlated \(z\)-values and the accuracy of large-scale statistical estimates. J. Amer. Statist. Assoc. 105 1042-1055. · Zbl 1390.62139 · doi:10.1198/jasa.2010.tm09129
[14] Efron, B. and Tibshirani, R. (2002). Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol. 23 70-86.
[15] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160. · Zbl 1073.62511 · doi:10.1198/016214501753382129
[16] Fan, J. and Han, X. (2017). Estimation of the false discovery proportion with unknown dependence. J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 1143-1164. · Zbl 1373.62272 · doi:10.1111/rssb.12204
[17] Fan, J., Han, X. and Gu, W. (2012). Estimating false discovery proportion under arbitrary covariance dependence. J. Amer. Statist. Assoc. 107 1019-1035. · Zbl 1395.62219 · doi:10.1080/01621459.2012.720478
[18] Fan, J. and Yim, T. H. (2004). A crossvalidation method for estimating conditional densities. Biometrika 91 819-834. · Zbl 1078.62032 · doi:10.1093/biomet/91.4.819
[19] Ferkingstad, E., Frigessi, A., Rue, H., Thorleifsson, G. and Kong, A. (2008). Unsupervised empirical Bayesian multiple testing with external covariates. Ann. Appl. Stat. 2 714-735. · Zbl 1400.62258 · doi:10.1214/08-AOAS158
[20] Gagnon-Bartsch, J. A. and Speed, T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13 539-552.
[21] Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B. Stat. Methodol. 64 499-517. · Zbl 1090.62072 · doi:10.1111/1467-9868.00347
[22] Gudjonsson, J. E., Ding, J., Johnston, A., Tejasvi, T., Guzman, A. M., Nair, R. P., Voorhees, J. J., Abecasis, G. R. and Elder, J. T. (2010). Assessment of the psoriatic transcriptome in a large sample: Additional regulated genes and comparisons with in vitro models. Journal of Investigative Dermatology 130 1829-1840.
[23] Hall, P., Racine, J. and Li, Q. (2004). Cross-validation and the estimation of conditional probability densities. J. Amer. Statist. Assoc. 99 1015-1026. · Zbl 1055.62035 · doi:10.1198/016214504000000548
[24] Hummel, M., Meister, R. and Mansmann, U. (2008). GlobalANCOVA: Exploration and assessment of gene group effects. Bioinformatics 24 78-85.
[25] Ignatiadis, N., Klaus, B., Zaugg, J. B. and Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nat. Methods 13 577-580.
[26] Jabbari, A., Suárez-Fariñas, M., Dewell, S. and Krueger, J. G. (2012). Transcriptional profiling of psoriasis using RNA-seq reveals previously unidentified differentially expressed genes. Journal of Investigative Dermatology 132 246-249.
[27] Jin, J. (2008). Proportion of non-zero normal means: Universal oracle equivalences and uniformly consistent estimators. J. R. Stat. Soc. Ser. B. Stat. Methodol. 70 461-493. · Zbl 05563355 · doi:10.1111/j.1467-9868.2007.00645.x
[28] Jin, J. and Cai, T. T. (2007). Estimating the null and the proportional of nonnull effects in large-scale multiple comparisons. J. Amer. Statist. Assoc. 102 495-506. · Zbl 1172.62319 · doi:10.1198/016214507000000167
[29] Kukurba, K. R. and Montgomery, S. B. (2015). RNA sequencing and analysis. Cold Spring Harbor Protocols 2015 951-969.
[30] Langaas, M., Lindqvist, B. H. and Ferkingstad, E. (2005). Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Stat. Soc. Ser. B. Stat. Methodol. 67 555-572. · Zbl 1095.62037 · doi:10.1111/j.1467-9868.2005.00515.x
[31] Law, C. W., Chen, Y., Shi, W. and Smyth, G. K. (2014). Voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15 R29.
[32] Leek, J. T. and Storey, J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3 e161.
[33] Li, A. and Barber, R. F. (2017). Accumulation tests for FDR control in ordered hypothesis testing. J. Amer. Statist. Assoc. 112 837-849.
[34] Liang, K. (2019). Supplement to “Empirical Bayes analysis of RNA sequencing experiments with auxiliary information.” DOI:10.1214/19-AOAS1270SUPP. · Zbl 1435.62391
[35] Liang, K. and Nettleton, D. (2012). Adaptive and dynamic adaptive procedures for false discovery rate control and estimation. J. R. Stat. Soc. Ser. B. Stat. Methodol. 74 163-182. · Zbl 1411.62226 · doi:10.1111/j.1467-9868.2011.01001.x
[36] MacDonald, P., Liang, K. and Janssen, A. (2019). Dynamic adaptive procedures that control the false discovery rate. Electron. J. Stat. 13 3009-3024. · Zbl 1429.62334 · doi:10.1214/19-EJS1589
[37] Martin, R. and Tokdar, S. (2012). A nonparametric empirical Bayes framework for large-scale multiple testing. Biostatistics 13 427-439. · Zbl 1244.62066 · doi:10.1093/biostatistics/kxr039
[38] Meinshausen, N. and Rice, J. (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Statist. 34 373-393. · Zbl 1091.62059 · doi:10.1214/009053605000000741
[39] Nestle, F. O., Conrad, C., Tun-Kyi, A., Homey, B., Gombert, M., Boyman, O., Burg, G., Liu, Y.-J. and Gilliet, M. (2005). Plasmacytoid predendritic cells initiate psoriasis through interferon-\( \alpha\) production. J. Exp. Med. 202 135-143.
[40] Newton, M. A. (2002). On a nonparametric recursive estimator of the mixing distribution. Sankhyā 64 306-322. · Zbl 1192.62110
[41] Oshlack, A. and Wakefield, M. J. (2009). Transcript length bias in RNA-seq data confounds systems biology. Biology Direct 4 14.
[42] Parisi, R., Symmons, D. P., Griffiths, C. E., Ashcroft, D. M. et al. (2013). Global epidemiology of psoriasis: A systematic review of incidence and prevalence. Journal of Investigative Dermatology 133 377-385.
[43] Patra, R. K. and Sen, B. (2016). Estimation of a two-component mixture model with applications to multiple testing. J. R. Stat. Soc. Ser. B. Stat. Methodol. 78 869-893. · Zbl 1414.62111 · doi:10.1111/rssb.12148
[44] Qu, L., Nettleton, D. and Dekkers, J. C. M. (2012). A hierarchical semiparametric model for incorporating intergene information for analysis of genomic data. Biometrics 68 1168-1177. · Zbl 1259.62101 · doi:10.1111/j.1541-0420.2012.01778.x
[45] Rosenblatt, M. (1969). Conditional probability density and regression estimators. In Multivariate Analysis, II (Proc. Second Internat. Sympos., Dayton, Ohio, 1968) 25-31. Academic Press, New York.
[46] Schwartzman, A. and Lin, X. (2011). The effect of correlation in false discovery rate estimation. Biometrika 98 199-214. · Zbl 1215.62071 · doi:10.1093/biomet/asq075
[47] Scott, J. G., Kelly, R. C., Smith, M. A., Zhou, P. and Kass, R. E. (2015). False discovery rate regression: An application to neural synchrony detection in primary visual cortex. J. Amer. Statist. Assoc. 110 459-471.
[48] Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Monographs on Statistics and Applied Probability. CRC Press, London. · Zbl 0617.62042
[49] Smyth, G. K. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3 Art. 3, 29. · Zbl 1038.62110 · doi:10.2202/1544-6115.1027
[50] Stange, J., Dickhaus, T., Navarro, A. and Schunk, D. (2016). Multiplicity- and dependency-adjusted \(p\)-values for control of the family-wise error rate. Statist. Probab. Lett. 111 32-40. · Zbl 1341.62247 · doi:10.1016/j.spl.2016.01.005
[51] Storey, J. D., Taylor, J. E. and Siegmund, D. (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. Ser. B. Stat. Methodol. 66 187-205. · Zbl 1061.62110 · doi:10.1111/j.1467-9868.2004.00439.x
[52] Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102 901-912. · Zbl 1469.62318 · doi:10.1198/016214507000000545
[53] Swindell, W. R., Xing, X., Voorhees, J. J., Elder, J. T., Johnston, A. and Gudjonsson, J. E. (2014). Integrative RNA-seq and microarray data analysis reveals GC content and gene length biases in the psoriasis transcriptome. Physiological Genomics 46 533-546.
[54] Tansey, W., Wang, Y., Blei, D. M. and Rabadan, R. (2018). Black box FDR. International Conference on Machine Learning 4874-4883.
[55] Tsoi, L. C., Iyer, M. K., Stuart, P. E., Swindell, W. R., Gudjonsson, J. E., Tejasvi, T., Sarkar, M. K., Li, B., Ding, J. et al. (2015). Analysis of long non-coding rnas highlights tissue-specific expression patterns and epigenetic profiles in normal and psoriatic skin. Genome Biol. 16 24.
[56] van der Fits, L., van der Wel, L., Laman, J. D., Prens, E. P. and Verschuren, M. C. (2004). In psoriasis lesional skin the type I interferon signaling pathway is activated, whereas interferon-\( \alpha\) sensitivity is unaltered. Journal of Investigative Dermatology 122 51-60.
[57] Wang, J., Zhao, Q., Hastie, T. and Owen, A. B. (2017). Confounder adjustment in multiple hypothesis testing. Ann. Statist. 45 1863-1894. · Zbl 1486.62223 · doi:10.1214/16-AOS1511
[58] Wasserman, L. (2006). All of Nonparametric Statistics. Springer Texts in Statistics. Springer, New York. · Zbl 1099.62029
[59] Wood, S. N. (2003). Thin plate regression splines. J. R. Stat. Soc. Ser. B. Stat. Methodol. 65 95-114. · Zbl 1063.62059 · doi:10.1111/1467-9868.00374
[60] Wood, S. N. (2017). Generalized Additive Models: An Introduction with \({\beta}fR\). Texts in Statistical Science Series. CRC Press, Boca Raton, FL. · Zbl 1368.62004
[61] Yao, Y., Richman, L., Morehouse, C., De Los Reyes, M., Higgs, B. W., Boutrin, A., White, B., Coyle, A., Krueger, J. et al. (2008). Type I interferon: Potential therapeutic target for psoriasis? PLoS ONE 3 e2737.
[62] Young, D. S. and Hunter, D. R. (2010). Mixtures of regressions with predictor-dependent mixing proportions. Comput. Statist. Data Anal. 54 2253-2266. · Zbl 1284.62467 · doi:10.1016/j.csda.2010.04.002
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.