zbMATH — the first resource for mathematics

Measuring reproducibility of high-throughput experiments. (English) Zbl 1231.62124
Summary: Reproducibility is essential to reliable scientific discovery in high-throughput experiments. We propose a unified approach to measure the reproducibility of findings identified from replicate experiments and identify putative discoveries using reproducibility. Unlike the usual scalar measures of reproducibility, our approach creates a curve, which quantitatively assesses when the findings are no longer consistent across replicates. Our curve is fitted by a copula mixture model, from which we derive a quantitative reproducibility score, which we call the “irreproducible discovery rate” (IDR) analogous to the FDR. This score can be computed at each set of paired replicate ranks and permits the principled setting of thresholds both for assessing reproducibility and combining replicates.
Since our approach permits an arbitrary scale for each replicate, it provides useful descriptive measures in a wide variety of situations to be explored. We study the performance of the algorithm using simulations and give a heuristic analysis of its theoretical properties. We demonstrate the effectiveness of our method in a ChIP-seq experiment.

62H99 Multivariate analysis
62-09 Graphical methods in statistics (MSC2010)
65C60 Computational problems in statistics (MSC2010)
62P10 Applications of statistics to biology and medical sciences; meta analysis
F-Seq; idr
Full Text: DOI
[1] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc. Ser. B 57 289-300. · Zbl 0809.62014
[2] Blest, D. C. (2000). Rank correlation-an alternative measure. Aust. N. Z. J. Stat. 42 101-111. · Zbl 0977.62061 · doi:10.1111/1467-842X.00110
[3] Boulesteix, A. L. and Slawski, M. (2009). Stability and aggregation of ranked gene lists. Briefings in Bioinformatics 10 556-568.
[4] Boyle, A. P., Guinney, J., Crawford, G. E. and Furey, T. S. (2008). F-Seq: A feature density estimator for high-throughput sequence tags. Bioinformatics 24 2537-2538.
[5] da Costa, J. P. and Soares, C. (2005). A weighted rank measure of correlation. Aust. N. Z. J. Stat. 47 515-529. · Zbl 1127.62052 · doi:10.1111/j.1467-842X.2005.00413.x
[6] Deheuvels, P. (1979). La fonction de dépendance empirique et ses propriétés. Un test non paramétrique d’indépendance. Acad. Roy. Belg. Bull. Cl. Sci. (5) 65 274-292. · Zbl 0422.62037
[7] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39 1-38. · Zbl 0364.62022
[8] Efron, B. (2004). Local false discovery rate. Technical report, Dept. Statistics, Stanford Univ.
[9] ENCODE Project Consortium (2004). The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306 636-640.
[10] Fisher, R. A. (1925). Statistical Methods for Research Workers , 1st ed. Oliver & Boyd, Edinburgh. · JFM 51.0414.08
[11] Fisher, N. I. and Switzer, P. (1985). Chi-plots for assessing dependence. Biometrika 72 253-265. · Zbl 0572.62047 · doi:10.1093/biomet/72.2.253
[12] Fisher, N. I. and Switzer, P. (2001). Graphical assessment of dependence: Is a picture worth 100 tests? Amer. Statist. 55 233-239. · Zbl 05680454 · doi:10.1198/000313001317098248
[13] Genest, C. and Boies, J.-C. (2003). Detecting dependence with Kendall plots. Amer. Statist. 57 275-284. · Zbl 1182.62005 · doi:10.1198/0003130032431
[14] Genest, C., Ghoudi, K. and Rivest, L. P. (1995). A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 82 543-552. · Zbl 0831.62030 · doi:10.1093/biomet/82.3.543
[15] Genest, C. and Plante, J.-F. (2003). On Blest’s measure of rank correlation. Canad. J. Statist. 31 35-52. · Zbl 1035.62058 · doi:10.2307/3315902
[16] Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499-517. · Zbl 1090.62072 · doi:10.1111/1467-9868.00347
[17] Hu, L. (2006). Dependence patterns across financial markets: A mixed copula approach. Applied Financial Economics 16 717-729.
[18] Ji, H., Jiang, H., Ma, W., Johnson, D. S., Myers, R. M. and Wong, W. H. (2008). An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature Biotechnology 26 1293-1300.
[19] Joe, H. (1997). Multivariate Models and Dependence Concepts. Monogr. Statist. Appl. Probab. 73 . Chapman & Hall, London. · Zbl 0990.62517
[20] Jothi, R., Cuddapah, S., Barski, A., Cui, K. and Zhao, K. (2008). Genome-wide identification of in vivo protein-DNA binding sites from ChIP-seq data. Nucleic Acids Res. 36 5221-5231.
[21] Kallenberg, W. C. M. and Ledwina, T. (1999). Data-driven rank tests for independence. J. Amer. Statist. Assoc. 94 285-301. · Zbl 1072.62574 · doi:10.2307/2669703
[22] Kharchenko, P. V., Tolstorukov, M. Y. and Park, P. J. (2008). Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nature Biotechnology 26 1351-1359.
[23] Kheradpour, P., Stark, A., Roy, S. and Kellis, M. (2007). Reliable prediction of regulator targets using 12 drosophila genomes. Genome Res. 17 1919-1931.
[24] Kuo, W., Liu, F., Trimarchi, J., Punzo, C., Lombardi, M., Sarang, J., Whipple, M. E. et al. (2006). A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nature Biotechnology 24 832-840.
[25] Lehmann, E. L. (2006). Nonparametrics: Statistical Methods Based on Ranks , 2nd ed. Springer, New York. · Zbl 1217.62061
[26] Li, Q., Brown, J. B., Huang, H. and Bickel, P. J. (2011). Supplement to “Measuring reproducibility of high-throughput experiments.” . · Zbl 1231.62124 · doi:10.1214/11-AOAS466 · dx.doi.org
[27] MAQC consortium (2006). The microarray quality control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nature Biotechnology 24 1151-1161.
[28] McLachlan, G. J. (1987). On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Applied Statistics 36 318-324.
[29] Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5 621-628.
[30] Nelson, R. B. (1999). An Introduction to Copulas , 2nd ed. Springer, New York.
[31] Oakes, D. (1994). Multivariate survival distributions. J. Nonparametr. Stat. 3 343-354. · Zbl 1378.62121 · doi:10.1080/10485259408832593
[32] Park, P. J. (2009). ChIP-seq: Advantages and challenges of a maturing technology. Nat. Rev. Genet. 10 669-680.
[33] Rozowsky, J., Euskirchen, G., Auerbach, R. K., Zhang, Z. D., Gibson, T., Bjornson, R., Carriero, N., Snyder, M. and Gerstein, M. B. (2009). PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature Biotechnology 27 66-75.
[34] Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 8 229-231. · Zbl 0100.14202
[35] Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 479-498. · Zbl 1090.62073 · doi:10.1111/1467-9868.00346
[36] Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the q -value. Ann. Statist. 31 2013-2035. · Zbl 1042.62026 · doi:10.1214/aos/1074290335
[37] Stouffer, S. A., Suchman, E. A., DeVinney, L. C., Star, S. A. and Williams, J. (1949). The American Soldier: Vol. 1. Adjustment During Army Life . Princeton Univ. Press, Princeton, NJ.
[38] Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102 901-912. · Zbl 05564419 · doi:10.1198/016214507000000545
[39] Thurman, R., Hawrylycz, M., Kuehn, S., Haugen, E. and Stamatoyannopoulos, S. (2011). Hotspot: A scan statistic for identifying enriched regions of short-read sequence tags. Unpublished manuscript, Univ. Washington.
[40] Valouev, A., Johnson, D. S., Sundquist, A., Medina, C., Anton, E., Batzoglou, S., Myers, R. M. and Sidow, A. (2008). Genome-wide analysis of transcription factor binding sites based on ChIP-seq data. Nature Methods 5 829-834.
[41] Zhang, Y., Liu, T., Meyer, C. A., Eeckhoute, J., Johnson, D. S., Bernstein, B. E., Nussbaum, C., Myers, R. M., Brown, M., Li, W. and Liu, X. S. (2008). Model-based analysis of ChIP-seq (MACS). Genome Biology 9 R137.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.