×

A decision-theory approach to interpretable set analysis for high-dimensional data. (English) Zbl 1429.62496

Summary: A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of “atoms”, non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting \(p\) values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62J15 Paired and multiple comparisons; multiple testing
62C12 Empirical decision procedures; empirical Bayes procedures
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Anderson, Penalized maximum likelihood estimation in logistic regression and discrimination, Biometrika 69 pp 123– (1982) · Zbl 0486.62032 · doi:10.1093/biomet/69.1.123
[2] Bauer, Going bayesian: Model-based gene set analysis of genome-scale data, Nucleic Acids Research 38 pp 3523– (2010) · doi:10.1093/nar/gkq045
[3] Benjamini, Controlling the false discovery rate: A practical and powerful approach to multiple testing, Journal of the Royal Statistical Society: Series B 57 pp 289– (1995) · Zbl 0809.62014
[4] Bouton, DRAGON view: Information visualization for annotated microarray data, Bioinformatics 18 pp 323– (2002) · doi:10.1093/bioinformatics/18.2.323
[5] Cai, Simultaneous testing of grouped hypotheses: Finding needles in multiple haystacks, Journal of the American Statistical Association 104 pp 1467– (2009) · Zbl 1205.62005 · doi:10.1198/jasa.2009.tm08415
[6] Carlson , M. Falcon , S. Pages , H. Li , N. 2011 hgu133a.db: Affymetrix Human Genome U133 Set annotation data (chip hgu133a)
[7] Carvalho, Centroid estimation in discrete high-dimensional spaces with applications in biology, Proceedings of the National Academy of Sciences 105 pp 3209– (2008) · doi:10.1073/pnas.0712329105
[8] Dabney , A. Storey , J. D. Warnes G. R. 2012 qvalue: Q-value estimation for false discovery rate control
[9] Efron, Empirical Bayes methods and false discovery rates for microarrays, Genetic Epidemiology 23 pp 70– (2002) · doi:10.1002/gepi.1124
[10] Efron, On testing the significance of sets of genes, Annals of Applied Statistics 1 pp 107– (2007) · Zbl 1129.62102 · doi:10.1214/07-AOAS101
[11] Efron, Empirical Bayes analysis of a microarray experiment, Journal of the American Statistical Association 96 pp 1151– (2001) · Zbl 1073.62511 · doi:10.1198/016214501753382129
[12] Friston , K. Ashburner , J. Stefan , K. Nichols , T. Penny , W. 2007 Statistical Parametric Mapping The Analysis of Functional Brain Images
[13] Genovese, Operating characteristics and extensions of the false discovery rate procedure, Journal of the Royal Statistical Society: Series B 64 pp 499– (2002) · Zbl 1090.62072 · doi:10.1111/1467-9868.00347
[14] Goeman, Analyzing gene expression data in terms of gene sets: Methodological issues, Bioinformatics 23 pp 980– (2007) · doi:10.1093/bioinformatics/btm051
[15] Green , P. J. Silverman , B. W. 1994 Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach · Zbl 0832.62032
[16] Henson, Face repetition effects in implicit and explicit memory tests as measured by fMRI, Cerebral Cortex 12 pp 178– (2002) · doi:10.1093/cercor/12.2.178
[17] Irizarry, Gene set enrichment analysis made simple, Statistical Methods in Medical Research 18 pp 565– (2009) · doi:10.1177/0962280209351908
[18] Jiang, Extensions to gene set enrichment, Bioinformatics 23 pp 306– (2007) · doi:10.1093/bioinformatics/btl599
[19] Kanehisa, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Research 28 pp 27– (2000) · Zbl 05435931 · doi:10.1093/nar/28.1.27
[20] Majewski, Opposing roles of polycomb repressive complexes in hematopoietic stem and progenitor cells, Blood 116 pp 731– (2010) · doi:10.1182/blood-2009-12-260760
[21] Maldjian, An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of fMRI data sets, NeuroImage 19 pp 1233– (2003) · doi:10.1016/S1053-8119(03)00169-1
[22] Mirnics, Molecular characterization of schizophrenia viewed by microarray analysis of gene expression in prefrontal cortex, Neuron 28 pp 53– (2000) · doi:10.1016/S0896-6273(00)00085-4
[23] Mootha, PGC-1{\(\alpha\)}-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes, Nature Genetics 34 pp 267– (2003) · doi:10.1038/ng1180
[24] Müller, Optimal sample size for multiple testing: The case of gene expression microarrays, Journal of the American Statistical Association 99 pp 990– (2004) · Zbl 1055.62127 · doi:10.1198/016214504000001646
[25] Newton , M. A. Kendziorski , C. 2003 Parametric empirical Bayes methods for microarrays The Analysis of Gene Expression Data: Methods and Software
[26] Newton, Detecting differential gene expression with a semiparametric hierarchical mixture method, Biostatistics 5 pp 155– (2004) · Zbl 1096.62124 · doi:10.1093/biostatistics/5.2.155
[27] Parsons, An integrated genomic analysis of human glioblastoma multiforme, Science 321 pp 1807– (2008) · doi:10.1126/science.1164382
[28] Quackenbush, Computational analysis of microarray data, Nature Reviews Genetics 2 pp 418– (2001) · doi:10.1038/35076576
[29] Smyth, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments, Statistical Applications in Genetics and Molecular Biology 3 pp 1027– (2004) · Zbl 1038.62110 · doi:10.2202/1544-6115.1027
[30] Sotiriou, Gene expression profiling in breast cancer: Understanding the molecular basis of histologic grade to improve prognosis, Journal of the National Cancer Institute 98 pp 262– (2006) · doi:10.1093/jnci/djj052
[31] Storey, A direct approach to false discovery rates, Journal of the Royal Statistical Society Series B (Statistical Methodology) 64 pp 479– (2002) · Zbl 1090.62073 · doi:10.1111/1467-9868.00346
[32] Storey, The positive false discovery rate: A Bayesian interpretation and the q-value, The Annals of Statistics 31 pp 2013– (2003) · Zbl 1042.62026 · doi:10.1214/aos/1074290335
[33] Storey, Multiple locus linkage analysis of genomewide expression in yeast, PLoS Biology 3 (2005) · doi:10.1371/journal.pbio.0030267
[34] Subramanian, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of the National Academy of Sciences 102 pp 15545– (2005) · doi:10.1073/pnas.0506580102
[35] Tavazoie, Systematic determination of genetic network architecture, Nature Genetics 22 pp 281– (1999) · doi:10.1038/10343
[36] Tzourio-Mazoyer, Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain, Neuroimage 15 pp 273– (2002) · doi:10.1006/nimg.2001.0978
[37] Wu, Roast: Rotation gene set tests for complex microarray experiments, Bioinformatics 26 pp 2176– (2010) · doi:10.1093/bioinformatics/btq401
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.