A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis. (English) Zbl 1454.62417

Summary: We develop a model-based methodology for integrating gene-set information with an experimentally-derived gene list. The methodology uses a previously reported sampling model, but takes advantage of natural constraints in the high-dimensional discrete parameter space in order to work from a more structured prior distribution than is currently available. We show how the natural constraints are expressed in terms of linear inequality constraints within a set of binary latent variables. Further, the currently available prior gives low probability to these constraints in complex systems, such as Gene Ontology (GO), thus reducing the efficiency of statistical inference. We develop two computational advances to enable posterior inference within the constrained parameter space: one using integer linear programming for optimization and one using a penalized Markov chain sampler. Numerical experiments demonstrate the utility of the new methodology for a multivariate integration of genomic data with GO or related information systems. Compared to available methods, the proposed multi-functional analyzer covers more reported genes without mis-covering nonreported genes, as demonstrated on genome-wide data from association studies of type 2 diabetes and from RNA interference studies of influenza.


62P10 Applications of statistics to biology and medical sciences; meta analysis
Full Text: DOI arXiv Euclid


[1] Arratia, R., Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein method. Statist. Sci. 5 403-434. · Zbl 0955.62542
[2] Barry, W. T., Nobel, A. B. and Wright, F. A. (2008). A statistical framework for testing functional categories in microarray data. Ann. Appl. Stat. 2 286-315. · Zbl 1137.62390
[3] Bauer, S., Gagneur, J. and Robinson, P. N. (2010). GOing Bayesian: Model-based gene set analysis of genome-scale data. Nucleic Acids Res. 38 3523-3532.
[4] Bauer, S., Robinson, P. N. and Gagneur, J. (2011). Model-based gene set analysis for Bioconductor. Bioinformatics 27 1882-1883.
[5] Carvalho, L. E. and Lawrence, C. E. (2008). Centroid estimators for inference in high-dimensional discrete spaces. Proc. Natl. Acad. Sci. USA 105 3209-3214.
[6] Gentleman, R., Carey, V. J., Bates, D. M., Bolstad, B., Dettling, M., Dudoit, S., Ellis, B., Gautier, L., Ge, Y. et alet al. (2004). Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 5 R80.
[7] Goeman, J. J. and Bühlmann, P. (2007). Analyzing gene expression data in terms of gene sets: Methodological issues. Bioinformatics 23 980-987.
[8] Hao, L., He, Q., Wang, Z., Craven, M., Newton, M. A. and Ahlquist, P. (2013). Limited agreement of independent RNAi screens for virus-required host genes owes more to false-negative than false-positive factors. PLoS Comput. Biol. 9 e1003235, 20.
[9] Kanehisa, M. and Goto, S. (2000). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27-30.
[10] Khatri, P., Sirota, M. and Butte, A. J. (2012). Ten years of pathway analysis: Current approaches and outstanding challenges. PLoS Comput. Biol. 8 e1002375.
[11] Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., Garapati, P., Hemish, J., Hermjakob, H., Jassal, B., Kanapin, A., Lewis, S., Mahajan, S., May, B., Schmidt, E., Vastrik, I., Wu, G., Birney, E., Stein, L. and D’Eustachio, P. (2009). Reactome knowledgebase of biological pathways and processes. Nucleic Acids Res. 37 D619-D622.
[12] Morris, A. P. et alet al. (2012). Large-scale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat. Genet. 44 981-990.
[13] Newton, M. A., He, Q. and Kendziorski, C. (2012). A model-based analysis to infer the functional content of a gene list. Stat. Appl. Genet. Mol. Biol. 11 Art. 9, 27. · Zbl 1296.92059
[14] R Development Core Team (2011). R : A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria. Available at .
[15] Sartor, M. A., Leikauf, G. D. and Medvedovic, M. (2009). LRpath: A logistic regression approach for identifying enriched biological groups in gene expression data. Bionformatics 25 211-217.
[16] The Gene Ontology Consortium (2000). Gene ontology: Tool for the unification of biology. Nat. Genet. 25 25-29.
[17] Wang, Z., He, Q., Larget, B. and Newton, M. A. (2014). Supplement to “A multi-functional analyzer uses parameter constraints to improve the efficiency of model-based gene-set analysis.” .
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.