×

A multiple testing protocol for exploratory data analysis and the local misclassification rate. (English) Zbl 1508.62184

Summary: A false discovery rate (FDR) procedure is often employed in exploratory data analysis to determine which among thousands or millions of attributes are worthy of follow-up analysis. However, these methods tend to discover the most statistically significant attributes, which need not be the most worthy of further exploration. This article provides a new FDR-controlling method that allows for the nature of the exploratory analysis to be considered when determining which attributes are discovered. To illustrate, a study in which the objective is to classify discoveries into one of several clusters is considered, and a new FDR method that minimizes the misclassification rate is developed. It is shown analytically and with simulation that the proposed method performs better than competing methods.

MSC:

62J15 Paired and multiple comparisons; multiple testing
62C25 Compound decision problems in statistical decision theory
62F03 Parametric hypothesis testing
62P10 Applications of statistics to biology and medical sciences; meta analysis

Software:

R
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Anderson, M., and J. D. Habiger. 2012. Characterization and Identification of Productivity-Associated Rhizobacteria in Wheat. Applied and Environmental Microbiology 78:4434-46. · doi:10.1128/AEM.07466-11
[2] Anderson, T. W. 1984. An Introduction to Multivariate Statistical Analysis. New York: Wiley, 2nd ed. · Zbl 0651.62041
[3] Benjamini, Y., and M. Bogomolov. 2014. Selective inference on multiple families of hypotheses. Journal of the Royal Statistical Society. Series B 76:297-318. · Zbl 1411.62221 · doi:10.1111/rssb.12028
[4] Benjamini, Y., and Y. Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B 57:289-300. · Zbl 0809.62014 · doi:10.1111/j.2517-6161.1995.tb02031.x
[5] Benjamini, Y., and D. Yekutieli. 2001. The Control of the False Discovery Rate in Multiple Testing under Dependency. The Annals of Statistics 29:1165-88. · Zbl 1041.62061 · doi:10.1214/aos/1013699998
[6] Berger, J., and T. Sellke. 1987. Testing a Point Null Hypothesis: The Irreconcilability of P-values and Evidence. Journal of the American Statistical Association 82:112-22. · Zbl 0612.62022
[7] Cai, T. T., and W. Sun. 2009. Simultaneous Testing of Grouped Hypotheses: Finding Needles in Multiple Haystacks. Journal of the American Statistical Association 104:1467-81. · Zbl 1205.62005
[8] Cao, H., W. Sun, and M. R. Kosorok. 2013. The optimal power puzzle: scrutiny of the monotone likelihood ratio assumption in multiple testing. Biometrika 495-502. · Zbl 1284.62470 · doi:10.1093/biomet/ast001
[9] Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B 39:1-38. · Zbl 0364.62022 · doi:10.1111/j.2517-6161.1977.tb01600.x
[10] Dudoit, S., and M. van der Laan. 2008. Multiple Testing Procedures with Applications to Genomics. New York, NY: Springer. · Zbl 1261.62014 · doi:10.1007/978-0-387-49317-6
[11] Efron, B. 2004. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Journal of the American Statistical Association 99:96-104. · Zbl 1089.62502
[12] ——— 2010. Large-scale inference: empirical Bayes methods for estimation, testing, and prediction. Cambridge: Cambridge University Press. · Zbl 1277.62016
[13] Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher. 2001. Empirical Bayes Analysis of a Microarray Experiment. Journal of the American Statistical Association 96:1151-60. · Zbl 1073.62511
[14] Gelman, A., and J. Carlin. 2014. Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Psychological Science 9:641-51.
[15] Genovese, C., and L. Wasserman. 2002. Operating Characteristics and Extensions of the False Discovery Rate Procedure. Journal of the Royal Statistical Society. Series B 64:499-517. · Zbl 1090.62072 · doi:10.1111/1467-9868.00347
[16] Habiger, J. 2017. Adaptive False Discovery Rate Control for Heterogeneous Data. Statistica Sinica (in press), DOI: 10.5705/ss.202016.0169. · Zbl 1392.62229 · doi:10.5705/ss.202016.0169
[17] Habiger, J., and E. Peña. 2011. Randomizes p-values and nonparametric procedures in multiple testing. Journal of Nonparametric Statistics 23:583-604. · Zbl 1228.62088
[18] Habiger, J., D. Watts, and M. Anderson. 2016. Multiple testing with heterogeneous multinomial distributions. Biometrics, DOI: 10.1111/biom.1286. · Zbl 1372.62067 · doi:10.1111/biom.1286
[19] He, L., S. K. Sarkar, and Z. Zhao. 2015. Capturing the severity of type {II} errors in high-dimensional multiple testing. Journal of Multivariate Analysis 142:106-16. · Zbl 1327.62432 · doi:10.1016/j.jmva.2015.08.005
[20] Liang, K., and D. Nettleton. 2012. Adaptive and dynamic adaptive procedures for false discovery rate control and estimation. Journal of the Royal Statistical Society. Series B 74:163-82. · Zbl 1411.62226 · doi:10.1111/j.1467-9868.2011.01001.x
[21] Lindquist, M. A. 2008. The statistical analysis of fMRI data. Statistical Science 23:439-64. · Zbl 1329.62296 · doi:10.1214/09-STS282
[22] McCullagh, P., and J. A. Nelder. 1989. Generalized linear models. vol. 37.; 37, New York; London: Chapman and Hall, 2nd ed. · Zbl 0588.62104 · doi:10.1007/978-1-4899-3242-6
[23] McLachlan, G. J., and T. Krishnan. 2008. The EM Algorithm and Extensions. Hoboken, NJ: Wiley-Interscience. · Zbl 1165.62019 · doi:10.1002/9780470191613
[24] McLachlan, G. J., and D. Peel. 2000. Finite Mixture Models. New York: Wiley. · Zbl 0963.62061 · doi:10.1002/0471721182
[25] R Core Team 2015. R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria.
[26] Roeder, K., and L. Wasserman. 2009. Genome-Wide Significance Levels and Weighted Hypothesis Testing. Statistical Science 24:398-413. · Zbl 1329.62435 · doi:10.1214/09-STS289
[27] Ruppert, D., D. Nettleton, and J. T. G. Hwang. 2007. Exploring the Information in p-Values for the Analysis and Planning of Multiple-Test Experiments. Biometrics 63:483-95. · Zbl 1152.62087 · doi:10.1111/j.1541-0420.2006.00704.x
[28] Sarkar, S. K. 2007. Stepup procedures controlling generalized FWER and generalized FDR. The Annals of Statistics 35:2405-20. · Zbl 1129.62066 · doi:10.1214/009053607000000398
[29] Storey, J. D. 2002. A Direct Approach to False Discovery Rates. Journal of the Royal Statistical Society. Series B 64:479-98. · Zbl 1090.62073 · doi:10.1111/1467-9868.00346
[30] ——— 2003. The positive false discovery rate: a Bayesian interpretation and the q-value. The Annals of Statistics 31:2013-35. · Zbl 1042.62026 · doi:10.1214/aos/1074290335
[31] ——— 2007. The optimal discovery procedure: a new approach to simultaneous significance testing. Journal of the Royal Statistical Society. Series B 69:347-68. · Zbl 07555356 · doi:10.1111/j.1467-9868.2007.005592.x
[32] Storey, J. D., J. E. Taylor, and D. Siegmund. 2004. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society. Series B 66:187-205. · Zbl 1061.62110 · doi:10.1111/j.1467-9868.2004.00439.x
[33] Sun, W., and T. T. Cai. 2007. Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control. Journal of the American Statistical Association 102:901-12. · Zbl 1469.62318
[34] Sun, W., and A. C. McLain. 2012. Multiple testing of composite null hypotheses in heteroscedastic models. Journal of the American Statistical Association 107:673-87. · Zbl 1261.62016
[35] Sun, W., B. J. Reich, T. Tony Cai, M. Guindani, and A. Schwartzman. 2015. False discovery control in large-scale spatial multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 77:59-83. · Zbl 1414.62043 · doi:10.1111/rssb.12064
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.