×

Covariate-assisted ranking and screening for large-scale two-sample inference. (English) Zbl 1420.62032

Summary: Two-sample multiple testing has a wide range of applications. The conventional practice first reduces the original observations to a vector of \(p\)-values and then chooses a cut-off to adjust for multiplicity. However, this data reduction step could cause significant loss of information and thus lead to suboptimal testing procedures. We introduce a new framework for two-sample multiple testing by incorporating a carefully constructed auxiliary variable in inference to improve the power. A data-driven multiple-testing procedure is developed by employing a covariate-assisted ranking and screening (CARS) approach that optimally combines the information from both the primary and the auxiliary variables. The proposed CARS procedure is shown to be asymptotically valid and optimal for false discovery rate control. The procedure is implemented in the R package CARS. Numerical results confirm the effectiveness of CARS in false discovery rate control and show that it achieves substantial power gain over existing methods. CARS is also illustrated through an application to the analysis of a satellite imaging data set for supernova detection.

MSC:

62C25 Compound decision problems in statistical decision theory
62P35 Applications of statistics to physics
62H35 Image analysis in multivariate analysis
85A15 Galactic and stellar structure

Software:

KernSmooth; R; CARS
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Barber, R. F. and Ramdas, A. (2017) The p‐filter: multilayer false discovery rate control for grouped hypotheses. J. R. Statist. Soc. B, 79, 1247- 1268. · Zbl 1373.62041
[2] Basu, P., Cai, T. T., Das, K. and Sun, W. (2018) Weighted false discovery control in large‐scale multiple testing. J. Am. Statist. Ass., 113, 1172- 1183. · Zbl 1402.62050
[3] Benjamini, Y. and Heller, R. (2008) Screening for partial conjunction hypotheses. Biometrics, 64, 1215- 1222. · Zbl 1152.62045
[4] Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289- 300. · Zbl 0809.62014
[5] Benjamini, Y. and Hochberg, Y. (1997) Multiple hypotheses testing with weights. Scand. J. Statist., 24, 407- 418. · Zbl 1090.62548
[6] Benjamini, Y. and Hochberg, Y. (2000) On the adaptive control of the false discovery rate in multiple testing with independent statistics. J. Educ. Behav. Statist., 25, 60- 83.
[7] Boca, S. M. and Leek, J. T. (2017) A regression framework for the proportion of true null hypotheses. Preprint bioRxiv 035675. Johns Hopkins University, Baltimore.
[8] Bourgon, R., Gentleman, R. and Huber, W. (2010) Independent filtering increases detection power for high‐throughput experiments. Proc. Natn. Acad. Sci. USA, 107, 9546- 9551.
[9] Brown, L. D. (1990) An ancillarity paradox which appears in multiple linear regression. Ann. Statist., 18, 471- 493. · Zbl 0721.62011
[10] Cai, T. T. and Jin, J. (2010) Optimal rates of convergence for estimating the null density and proportion of non‐null effects in large‐scale multiple testing. Ann. Statist., 38, 100- 145. · Zbl 1181.62040
[11] Cai, T. T. and Sun, W. (2009) Simultaneous testing of grouped hypotheses: finding needles in multiple haystacks. J. Am. Statist. Ass., 104, 1467- 1481. · Zbl 1205.62005
[12] Cai, T. T. and Wu, Y. (2014) Optimal detection of sparse mixtures against a given null distribution. IEEE Trans. Inform. Theory, 60, 2217- 2232. · Zbl 1360.94108
[13] Calvano, S. E., Xiao, W., Richards, D. R., Felciano, R. M., Baker, H. V., Cho, R. J., Chen, R. O., Brownstein, B. H., Cobb, J. P., Tschoeke, S. K., Moldawer, L. L., Mindrinos, M. N., Davis, R. W., Tompkins, R. G., Lowry, S. F. and Inflamm and Host Response to Injury Large Scale Collab. Res. Program (2005) A network‐based analysis of systemic inflammation in humans. Nature, 437, 1032- 1037.
[14] Cao, H., Sun, W. and Kosorok, M. R. (2013) The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. Biometrika, 100, 495- 502. · Zbl 1284.62470
[15] Donoho, D. and Jin, J. (2004) Higher criticism for detecting sparse heterogeneous mixtures. Ann. Statist., 32, 962- 994. · Zbl 1092.62051
[16] Du, L. and Zhang, C. (2014) Single‐index modulated multiple testing. Ann. Statist., 42, 1262- 1311. · Zbl 1297.62217
[17] Durand, G. (2017) Adaptive p‐value weighting with power optimality. Preprint arXiv:1710.01094. Laboratoire de Probabilités et Modèles Aléatoires, Université Pierre et Marie Curie, Paris.
[18] Efron, B. (2004) Large‐scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Statist. Ass., 99, 96- 104. · Zbl 1089.62502
[19] Efron, B. (2007) Size, power and false discovery rates. Ann. Statist., 35, 1351- 1377. · Zbl 1123.62008
[20] Efron, B. (2008) Simultaneous inference: when should hypothesis testing problems be combined?Ann. Appl. Statist., 2, 197- 223. · Zbl 1137.62010
[21] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Statist. Ass., 96, 1151- 1160. · Zbl 1073.62511
[22] Ferkingstad, E., Frigessi, A., Rue, H., Thorleifsson, G. and Kong, A. (2008) Unsupervised empirical bayesian multiple testing with external covariates. Ann. Appl. Statist., 2, 714- 735. · Zbl 1400.62258
[23] Foster, D. P. and George, E. I. (1996) A simple ancillarity paradox. Scand. J. Statist., 23, 233- 242. · Zbl 0854.62004
[24] Genovese, C. and Wasserman, L. (2002) Operating characteristics and extensions of the false discovery rate procedure. J. R. Statist. Soc. B, 64, 499- 517. · Zbl 1090.62072
[25] Genovese, C. and Wasserman, L. (2004) A stochastic process approach to false discovery control. Ann. Statist., 32, 1035- 1061. · Zbl 1092.62065
[26] Heller, R., Bogomolov, M. and Benjamini, Y. (2014) Deciding whether follow‐up studies have replicated findings in a preliminary large‐scale omics study. Proc. Natn. Acad. Sci. USA, 111, 16262- 16267.
[27] Heller, R. and Yekutieli, D. (2014) Replicability analysis for genome‐wide association studies. Ann. Appl. Statist., 8, 481- 498. · Zbl 1454.62340
[28] Hu, J. X., Zhao, H. and Zhou, H. H. (2010) False discovery rate control with groups. J. Am. Statist. Ass., 105, 1215- 1227. · Zbl 1390.62143
[29] James, W. and Stein, C. (1961) Estimation with quadratic loss. In Proc. 4th Berkeley Symp. Mathematical Statistics and Probability, vol. 1 (ed. J. Neyman), pp. 361- 379. Berkeley: University of California Press. · Zbl 1281.62026
[30] Jin, J. and Cai, T. T. (2007) Estimating the null and the proportion of nonnull effects in large‐scale multiple comparisons. J. Am. Statist. Ass., 102, 495- 506. · Zbl 1172.62319
[31] Langaas, M., Lindqvist, B. H. and Ferkingstad, E. (2005) Estimating the proportion of true null hypotheses, with application to DNA microarray data. J. R. Statist. Soc. B, 67, 555- 572. · Zbl 1095.62037
[32] Law, N. M., Kulkarni, S. R., Dekany, R. G., Ofek, E. O., Quimby, R. M., Nugent, P. E., Surace, J., Grillmair, C. C., Bloom, J. S., Kasliwal, M. M., Bildsten, L., Brown, T., Cenko, S. B., Ciardi, D., Croner, E., Djorgovski, S. G., van Eyken, J. C., Filippenko, A. V., Fox, D. B., Gal‐Yam, A., Hale, D., Hamam, N., Helou, G., Henning, J. R., Howell, D. A., Jacobsen, J., Laher, R., Mattingly, S., McKenna, D., Pickles, A., Poznanski, D., Rahmer, G., Rau, A., Rosing, W., Shara, M., Smith, R., Starr, D., Sullivan, M., Velur, V., Walters, R. S. and Zolkower, J. (2009) The Palomar Transient Factory: system overview, performance, and first results. Publ. Astron. Soc. Pacif., 121, 1395.
[33] Lehmann, E. L. and Casella, G. (2006) Theory of Point Estimation. New York: Springer Science and Business Media. · Zbl 0916.62017
[34] Li, A. and Barber, R. F. (2016) Multiple testing with the structure adaptive Benjamini‐Hochberg algorithm. Preprint arXiv:1606.07926. · Zbl 1407.62284
[35] Liu, W. (2014) Incorporation of sparsity information in large‐scale multiple two‐sample t tests. Preprint arXiv:1410.4282. Shanghai Jiao Tong University, Shanghai.
[36] Liu, Y., Sarkar, S. K. and Zhao, Z. (2016) A new approach to multiple testing of grouped hypotheses. J. Statist. Planng Inf., 179, 1- 14. · Zbl 1364.62203
[37] Neuvial, P. (2013) Asymptotic results on adaptive false discovery rate controlling procedures based on kernel estimators. J. Mach. Learn. Res., 14, 1423- 1459. · Zbl 1318.62252
[38] Nugent, P. E., Sullivan, M., Cenko, S. B., Thomas, R. C., Kasen, D., Howell, D. A., Bersier, D., Bloom, J. S., Kulkarni, S. R., Kandrashoff, M. T., Filippenko, A. V., Silverman, J. M., Marcy, J. M., Howard, A. W., Isaacson, H. T., Maguire, K., Suzuki, N., Tarlton, J. E., Pan, Y.‐C., Bildsten, L., Fulton, B. J., Parrent, J. T., Sand, D., Podsiadlowski, P., Bianco, F. B., Dilday, B., Graham, M. L., Lyman, J., James, P., Kasliwal, M. M., Law, N. M., Quimby, R. M., Hook, I. M., Walker, E. S., Mazzali, P., Pian, E., Ofek, E. O., Gal‐Yam, A. and Poznanski, D. (2011) Supernova SN 2011fe from an exploding carbon‐oxygen white dwarf star. Nature, 480, 344- 347.
[39] Reiner‐Benaim, A., Yekutieli, D., Letwin, N. E., Elmer, G. I., Lee, N. H., Kafkafi, N. and Benjamini, Y. (2007) Associating quantitative behavioral traits with gene expression in the brain: searching for diamonds in the hay. Bioinformatics, 23, 2239- 2246.
[40] Roeder, K. and Wasserman, L. (2009) Genome‐wide significance levels and weighted hypothesis testing. Statist. Sci., 24, 398- 413. · Zbl 1329.62435
[41] Roquain, E. and VanDeWiel, M. A. (2009) Optimal weighting for false discovery rate control. Electron. J. Statist., 3, 678- 711. · Zbl 1326.62164
[42] Rubin, D., Dudoit, S. and van der Laan, M. (2006) A method to increase the power of multiple testing procedures through sample splitting. Statist. Appl. Genet. Molec. Biol., 5, article 19. · Zbl 1166.62318
[43] Sarkar, S. K. (2002) Some results on false discovery rate in stepwise multiple testing procedures. Ann. Statist., 30, 239- 257. · Zbl 1101.62349
[44] Sarkar, S. K. and Zhao, Z. (2017) Local false discovery rate based methods for multiple testing of one‐way classified hypotheses. Preprint arXiv:1712.05014. Temple University, Philadelphia.
[45] Schweder, T. and Spj⊘tvoll, E. (1982) Plots of p‐values to evaluate many tests simultaneously. Biometrika, 69, 493- 502.
[46] Scott, J. G., Kelly, R. C., Smith, M. A., Zhou, P. and Kass, R. E. (2015) False discovery rate regression: an application to neural synchrony detection in primary visual cortex. J. Am. Statist. Ass., 110, 459- 471.
[47] Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. Boca Raton: CRC Press. · Zbl 0617.62042
[48] Skol, A. D., Scott, L. J., Abecasis, G. R. and Boehnke, M. (2006) Joint analysis is more efficient than replication‐based analysis for two‐stage genome‐wide association studies. Nat. Genet., 38, 209- 213.
[49] Storey, J. D. (2002) A direct approach to false discovery rates. J. R. Statist. Soc. B, 64, 479- 498. · Zbl 1090.62073
[50] Sun, W. and Cai, T. T. (2007) Oracle and adaptive compound decision rules for false discovery rate control. J. Am. Statist. Ass., 102, 901- 912. · Zbl 1469.62318
[51] Sun, W. and Wei, Z. (2011) Large‐scale multiple testing for pattern identification, with applications to time‐course microarray experiments. J. Am. Statist. Ass., 106, 73- 88. · Zbl 1396.62261
[52] Taylor, J., Tibshirani, R. and Efron, B. (2005) The “miss rate” for the analysis of gene expression data. Biostatistics, 6, 111- 117. · Zbl 1069.62104
[53] Tukey, J. W. (1994) The Collected Works of John W. Tukey, vol. 3. New York: Taylor and Francis. · Zbl 0807.01035
[54] Wand, M. and Jones, M. (1995) Kernel Smoothing. London: Chapman and Hall. · Zbl 0854.62043
[55] Wasserman, L. and Roeder, K. (2009) High‐dimensional variable selection. Ann. Statist., 37, 2178- 2201. · Zbl 1173.62054
[56] Zablocki, R. W., Schork, A. J., Levine, R. A., Andreassen, O. A., Dale, A. M. and Thompson, W. K. (2014) Covariate‐modulated local false discovery rate for genome‐wide association studies. Bioinformatics, 30, 2098- 2104.
[57] Zehetmayer, S., Bauer, P. and Posch, M. (2005) Two‐stage designs for experiments with a large number of hypotheses. Bioinformatics, 21, 3771- 3777.
[58] Zehetmayer, S., Bauer, P. and Posch, M. (2008) Optimized multi‐stage designs controlling the false discovery or the family‐wise error rate. Statist. Med., 27, 4145- 4160.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.