Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. (English) Zbl 1228.62150

Summary: The vast amount of biological knowledge accumulated over the years has allowed researchers to identify various biochemical interactions and define different families of pathways. There is an increased interest in identifying pathways and pathway elements involved in particular biological processes. Drug discovery efforts, for example, are focused on identifying biomarkers as well as pathways related to a disease. We propose a Bayesian model that addresses this question by incorporating information on pathways and gene networks in the analysis of DNA microarray data. Such information is used to define pathway summaries, specify prior distributions, and structure the MCMC moves to fit the model. We illustrate the method with an application to gene expression data with censored survival outcomes. In addition to identifying markers that would have been missed otherwise and improving prediction accuracy, the integration of existing biological knowledge into the analysis provides a better understanding of underlying molecular processes.


62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
62F15 Bayesian inference
65C40 Numerical analysis or methods applied to Markov chains
62N01 Censored data models
Full Text: DOI arXiv


[1] Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. J. Amer. Statist. Assoc. 88 669-679. · Zbl 0774.62031
[2] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. and Sherlock, G. (2000). Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25 25-29.
[3] Bair, E., Hastie, T., Paul, D. and Tibshirani, R. (2006). Prediction by supervised principal components. J. Amer. Statist. Assoc. 101 119-137. · Zbl 1118.62326
[4] Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser. B 36 192-236. · Zbl 0327.60067
[5] Bild, A. H., Yao, G., Chang, J. T., Wang, Q., Potti, A., Chasse, D., Joshi, M.-B., Harpole, D., Lancaster, J. M., Berchuck, A., Olson, J. A. Jr., Marks, J. R., Dressman, H. K., West, M. and Nevins, J. R. (2006). Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature 439 353-357.
[6] Boulesteix, A.-L. and Strimmer, K. (2007). Partial least squares: A versatile tool for the analysis of high-dimensional genomic data. Brief. Bioinformatics 8 32-44.
[7] Brown, P. J., Vannucci, M. and Fearn, T. (1998). Multivariate Bayesian variable selection and prediction. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 627-641. · Zbl 0909.62022
[8] Chipman, H., George, E. I. and McCulloch, R. E. (2001). The practical implementation of Bayesian model selection. In Model Selection. Institute of Mathematical Statistics Lecture Notes-Monograph Series 38 65-134. IMS, Beachwood, OH.
[9] Dahlquist, K. D., Salomonis, N., Vranizan, K., Lawlor, S. C. and Conklin, B. R. (2002). GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet. 31 19-20.
[10] Denkert, C., Winzer, K.-J. and Hauptmann, S. (2004). Prognostic impact of cyclooxygenase-2 in breast cancer. Clin. Breast Cancer 4 428-433.
[11] Doniger, S., Salomonis, N., Dahlquist, K., Vranizan, K., Lawlor, S. and Conklin, B. (2003). MAPPFinder: Using Gene Ontology and GenMAPP to create a global gene-expression profile for microarray data. Genome Biology 41 R7.
[12] Downward, J. (2006). Cancer biology: Signatures guide drug choice. Nature 439 274-275.
[13] Frankel, L. B., Lykkesfeldt, A. E., Hansen, J. B. and Stenvang, J. (2007). Protein Kinase C alpha is a marker for antiestrogen resistance and is involved in the growth of tamoxifen resistant human breast cancer cells. Breast Cancer Res. Treat. 104 165-179.
[14] Friedman, J., Hastie, T. and Tibshirani, R. (2010). A note on the group lasso and a sparse group lasso. Technical report, Dept. Stat., Stanford Univ.
[15] George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statist. Sinica 7 339-373. · Zbl 0884.62031
[16] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286 531-537. · Zbl 1047.65504
[17] Guan, Y. and Stephens, M. (2011). Bayesian variable selection regression for genome-wide association studies, and other large-scale problems. Ann. Appl. Stat. · Zbl 1229.62145
[18] Guo, W., Pylayeva, Y., Pepe, A., Yoshioka, T., Muller, W. J., Inghirami, G. and Giancotti, F. G. (2006). Beta 4 integrin amplifies ErbB2 signaling to promote mammary tumorigenesis. Cell 126 489-502.
[19] Gupta, G. P., Nguyen, D. X., Chiang, A. C., Bos, P. D., Kim, J. Y., Nadal, C., Gomis, R. R., Manova-Todorova, K. and Massagué, J. (2007). Mediators of vascular remodelling co-opted for sequential steps in lung metastasis. Nature 446 765-770.
[20] Joshi-Tope, G., Gillespie, M., Vastrik, I., D’Eustachio, P., Schmidt, E., de Bono, B., Jassal, B., Gopinath, G. R., Wu, G. R., Matthews, L., Lewis, S., Birney, E. and Stein, L. (2005). Reactome: A knowledgebase of biological pathways. Nucleic Acids Res. 33 D428-D432.
[21] Kanehisa, M. and Goto, S. (2000). Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28 27-30.
[22] Krieger, C., Zhang, P., Mueller, L., Wang, A., Paley, S., Arnaud, M., Pick, J., Rhee, S. and Karp, P. (2004). MetaCyc: A multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 32 D438-442.
[23] Kwon, D., Tadesse, M. G., Sha, N., Pfeiffer, R. M. and Vannucci, M. (2007). Identifying biomarkers from mass spectrometry data with ordinal outcome. Cancer Inform. 3 19-28.
[24] Kyung, M., Gill, J., Ghosh, M. and Casella, G. (2010). Penalized regression, standard errors, and Bayesian lassos. Bayesian Anal. 5 369-412. · Zbl 1330.62289
[25] Landemaine, T., Jackson, A., Bellahcène, A., Rucci, N., Sin, S., Abad, B. M., Sierra, A., Boudinet, A., Guinebretière, J.-M., Ricevuto, E., Noguès, C., Briffod, M., Bièche, I., Cherel, P., Garcia, T., Castronovo, V., Teti, A., Lidereau, R. and Driouch, K. (2008). A six-gene signature predicting breast cancer lung metastasis. Cancer Res. 68 6092-6099.
[26] Lee, S., Jeong, Y., Im, H. G., Kim, C., Chang, Y. and Lee, I. (2007). Silibinin suppresses PMA-induced MMP-9 expression by blocking the AP-1 activation via MAPK signaling pathways in MCF-7 human breast carcinoma cells. Biochemical and Biophysical Research Communications 354 65-171.
[27] Li, C. and Li, H. (2008). Network-constrained regularization and variable selection for analysis of genomics data. Bioinformatics 24 1175-1182. · Zbl 1022.68519
[28] Li, F. and Zhang, N. (2010). Bayesian Variable selection in structured high-dimensional covariate space with application in genomics. J. Amer. Statist. Assoc. 105 1202-1214. · Zbl 1390.62027
[29] Lindgren, F., Geladi, P. and Wold, S. (1993). The kernel algorithm of PLS. Journal of Chemometrics 7 45-59.
[30] Lønne, G. K., Cornmark, L., Zahirovic, I. O., Landberg, G., Jirström, K. and Larsson, C. (2010). PKCalpha expression is a marker for breast cancer aggressiveness. Mol. Cancer 9 76.
[31] Lucas, J., Carvalho, C., Wang, Q., Bild, A. Nevins, J. and West, M. (2006). Sparse statistical modelling in gene expression genomics. In Bayesian Inference for Gene Expression and Proteomics (K. Do, P. Mueller and M. Vannucci, eds.) 155-176. Cambridge Univ. Press, Cambridge.
[32] Møller, J., Pettitt, A. N., Reeves, R. and Berthelsen, K. K. (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93 451-458. · Zbl 1158.62020
[33] Nakao, M., Bono, H., Kawashima, S., Kamiya, T., Sato, K., Goto, S. and Kanehisa, M. (1999). Genome-scale gene expression analysis and pathway reconstruction in KEGG. Genome Informatics Series: Workshop on Genome Informatics 10 94-103.
[34] Pan, W., Xie, B. and Shen, X. (2010). Incorporating predictor network in penalized regression with application to microarray data. Biometrics 66 474-484. · Zbl 1192.62235
[35] Park, M. Y., Hastie, T. and Tibshirani, R. (2007). Averaged gene expressions for regression. Biostatistics 8 212-227. · Zbl 1144.62357
[36] Pittman, J., Huang, E., Dressman, H., Horng, C., Cheng, S., Tsou, M., Chen, C., Bild, A., Iversen, E., Huang, A., Nevins, J. and West, M. (2004). Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl. Acad. Sci. USA 101 8431-8436.
[37] Propp, J. G. and Wilson, D. B. (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. In Proceedings of the Seventh International Conference on Random Structures and Algorithms (Atlanta, GA, 1995) 9 223-252. · Zbl 0859.60067
[38] Sha, N., Tadesse, M. G. and Vannucci, M. (2006). Bayesian variable selection for the analysis of microarray data with censored outcomes. Bioinformatics 22 2262-2268.
[39] Sha, N., Vannucci, M., Tadesse, M. G., Brown, P. J., Dragoni, I., Davies, N., Roberts, T. C., Contestabile, A., Salmon, M., Buckley, C. and Falciani, F. (2004). Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60 812-828. · Zbl 1274.62428
[40] Shipp, M. A., Ross, K. N., Tamayo, P., Weng, A. P., Kutok, J. L., Aguiar, R. C. T., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G. S., Ray, T. S., Koval, M. A., Last, K. W., Norton, A., Lister, T. A., Mesirov, J., Neuberg, D. S., Lander, E. S., Aster, J. C. and Golub, T. R. (2002). Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8 68-74.
[41] Stingo, F. and Vannucci, M. (2011). btitleVariable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 27 495-501.
[42] Stingo, F., Chen, Y., Tadesse, M. and Vannucci, M. (2011). Supplement to: “Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes.” . · Zbl 1228.62150
[43] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A., Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 102 15545-15550.
[44] Telesca, D., Muller, P., Parmigiani, G. and Freedman, R. (2008). Modeling dependent gene expression. Technical report, Dept. of Biostatistics, Univ. Texas M.D. Anderson Cancer Center. · Zbl 1243.62038
[45] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267-288. · Zbl 0850.62538
[46] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D. and Altman, R. B. (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17 520-525.
[47] van’t Veer, L., Dai, H., van de Vijver, M., He, Y., Hart, A., Mao, M., Peterse, H., van der Kooy, K., Marton, M., Witteveen, A., Schreiber, G., Kerkhoven, R., Roberts, C., Linsley, P., Bernards, R. and Friend, S. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415 530-536.
[48] Wei, L. J. (1992). The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Stat. Med. 11 1871-1879.
[49] Wei, Z. and Li, H. (2007). A Markov random field model for network-based analysis of genomic data. Bioinformatics 23 1537-1544.
[50] Wei, Z. and Li, H. (2008). A hidden spatial-temporal Markov random field model for network-based analysis of time course gene expression data. Ann. Appl. Stat. 2 408-429. · Zbl 1137.62081
[51] Wold, H. (1966). Estimation of principal components and related models by iterative least squares. In Multivariate Analysis (Proc. Internat. Sympos., Dayton, Ohio, 1965) (P. Krishnaiaah, ed.) 391-420. Academic Press, New York. · Zbl 0214.46103
[52] Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49-67. · Zbl 1141.62030
[53] Zhang, J. D. and Wiemann, S. (2009). KEGGgraph: A graph approach to KEGG PATHWAY in R and bioconductor. Bioinformatics 25 1470-1471.
[54] Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 301-320. · Zbl 1069.62054
[55] Zou, H., Hastie, T. and Tibshirani, R. (2006). Sparse principal component analysis. J. Comput. Graph. Statist. 15 265-286.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.