Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis. (English) Zbl 1454.62416

Summary: Networks pervade many disciplines of science for analyzing complex systems with interacting components. In particular, this concept is commonly used to model interactions between genes and identify closely associated genes forming functional modules. In this paper, we focus on gene group interactions and infer these interactions using appropriate partial correlations between genes, that is, the conditional dependencies between genes after removing the influences of a set of other functionally related genes. We introduce a new method for estimating group interactions using sparse canonical correlation analysis (SCCA) coupled with repeated random partition and subsampling of the gene expression data set. By considering different subsets of genes and ways of grouping them, our interaction measure can be viewed as an aggregated estimate of partial correlations of different orders. Our approach is unique in evaluating conditional dependencies when the correct dependent sets are unknown or only partially known. As a result, a gene network can be constructed using the interaction measures as edge weights and gene functional groups can be inferred as tightly connected communities from the network. Comparisons with several popular approaches using simulated and real data show our procedure improves both the statistical significance and biological interpretability of the results. In addition to achieving considerably lower false positive rates, our procedure shows better performance in detecting important biological pathways.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62H20 Measures of association (correlation, canonical correlation, etc.)
92D10 Genetics and epigenetics
Full Text: DOI arXiv Euclid


[1] Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9 1981-2014. · Zbl 1225.68143
[2] Amini, A. A., Chen, A., Bickel, P. J. and Levina, E. (2013). Pseudo-likelihood methods for community detection in large sparse networks. Ann. Statist. 41 2097-2122. · Zbl 1277.62166
[3] Channarond, A., Daudin, J.-J. and Robin, S. (2012). Classification and estimation in the stochastic blockmodel based on the empirical degrees. Electron. J. Stat. 6 2574-2601. · Zbl 1295.62065
[4] D’haeseleer, P., Liang, S. and Somogyi, R. (2000). Genetic network inference: From co-expression clustering to reverse engineering. Bioinformatics 16 707-726.
[5] Daub, C. O., Steuer, R., Selbig, J. and Kloska, S. (2004). Estimating mutual information using B-spline functions-An improved similarity measure for analysing gene expression data. BMC Bioinformatics 5 118.
[6] Daudin, J.-J., Picard, F. and Robin, S. (2008). A mixture model for random graphs. Stat. Comput. 18 173-183.
[7] de la Fuente, A., Bing, N., Hoeschele, I. and Mendes, P. (2004). Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics 20 3565-3574.
[8] Edwards, D. (2000). Introduction to Graphical Modelling , 2nd ed. Springer, New York. · Zbl 0952.62003
[9] Friedman, J., Hastie, T. and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 432-441. · Zbl 1143.62076
[10] Gachon, C. M. M., Langlois-Meurinne, M., Henry, Y. and Saindrenan, P. (2005). Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: Functional and evolutionary implications. Plant Mol. Biol. 58 229-245.
[11] Holland, P. W., Laskey, K. B. and Leinhardt, S. (1983). Stochastic blockmodels: First steps. Social Networks 5 109-137.
[12] Hotelling, H. (1936). Relations between two sets of variates. Biometrika 28 321-377. · Zbl 0015.40705
[13] Jain, A. K., Murty, M. N. and Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys 31 264-323.
[14] Jiang, D., Tang, C. and Zhang, A. (2004). Cluster analysis for gene expression data: A survey. IEEE Transactions on Knowledge and Data Engineering 16 1370-1386.
[15] Karrer, B. and Newman, M. E. J. (2011). Stochastic blockmodels and community structure in networks. Phys. Rev. E (3) 83 016107, 10.
[16] Kaufman, L. and Rousseeuw, P. J. (2009). Finding Groups in Data : An Introduction to Cluster Analysis . Wiley, New York. · Zbl 1345.62009
[17] Kerr, G., Ruskin, H. J., Crane, M. and Doolan, P. (2008). Techniques for clustering gene expression data. Comput. Biol. Med. 38 283-293.
[18] Kim, K., Jiang, K., Teng, S. M., Feldman, L. J. and Huang, H. (2012). Using biologically interrelated experiments to identify pathway genes in Arabidopsis. Bioinformatics 28 815-822.
[19] Kinney, J. B. and Atwal, G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA 111 3354-3359. · Zbl 1359.62213
[20] Langfelder, P. and Horvath, S. (2007). Eigengene networks for studying the relationships between co-expression modules. BMC Syst. Biol. 1 54.
[21] Langfelder, P., Zhang, B. and Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: The Dynamic Tree Cut package for R. Bioinformatics 24 719-720.
[22] Lee, W., Lee, D., Lee, Y. and Pawitan, Y. (2011). Sparse canonical covariance analysis for high-throughput data. Stat. Appl. Genet. Mol. Biol. 10 Art. 30, 26. · Zbl 1296.92045
[23] Li, K.-C. (2002). Genome-wide coexpression dynamics: Theory and application. Proc. Natl. Acad. Sci. USA 99 16875-16880.
[24] Loreti, E., Poggi, A., Novi, G., Alpi, A. and Perata, P. (2005). A genome-wide analysis of the effects of sucrose on gene expression in Arabidopsis seedlings under anoxia. Plant Physiol. 137 1130-1138.
[25] Magwene, P. and Kim, J. (2004). Estimating genomic coexpression networks using first-order conditional independence. Genome Biology 5 R100.
[26] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the Lasso. Ann. Statist. 34 1436-1462. · Zbl 1113.62082
[27] Naoumkina, M. A., Zhao, Q., Gallego-Giraldo, L., Dai, X., Zhao, P. X. and Dixon, R. A. (2010). Genome-wide analysis of phenylpropanoid defence pathways. Mol. Plant Pathol. 11 829-846.
[28] Newman, M. E. J. (2010). Networks : An Introduction . Oxford Univ. Press, Oxford. · Zbl 1195.94003
[29] Parkhomenko, E., Tritchler, D. and Beyene, J. (2009). Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol. 8 Art. 1, 36. · Zbl 1276.92071
[30] Peng, J., Zhou, N. and Zhu, J. (2009). Partial correlation estimation by joint sparse regression models. J. Amer. Statist. Assoc. 104 735-746. · Zbl 1388.62046
[31] Ramesh, A., Trevino, R., Von Hoff, D. D. and Kim, S. (2010). Clustering context-specific gene regulatory networks. In Pacific Symposium on Biocomputing 444-455.
[32] Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., Lander, E. S., Mitzenmacher, M. and Sabeti, P. C. (2011). Detecting novel associations in large data sets. Science 334 1518-1524. · Zbl 1359.62216
[33] Schäfer, J. and Strimmer, K. (2005). An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 21 754-764.
[34] Scott, J. and Peter, J. C. (2011). The SAGE Handbook of Social Network Analysis . SAGE Publications, London.
[35] Sønderby, I. E., Geu-Flores, F. and Halkier, B. A. (2010). Biosynthesis of glucosinolates-Gene discovery and beyond. Trends in Plant Science 15 283-290.
[36] Steuer, R., Kurths, J., Daub, C. O., Weise, J. and Selbig, J. (2002). The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18 S231-S240.
[37] Taylor, L. P. and Grotewold, E. (2005). Flavonoids as developmental regulators. Curr. Opin. Plant Biol. 8 317-323.
[38] Teng, S. L. and Huang, H. (2009). A statistical framework to inter functional gene relationships from biologically interrelated microarray experiments. J. Amer. Statist. Assoc. 104 465-473. · Zbl 1388.62335
[39] Theodoridis, S. and Koutroumbas, K. (2005). Pattern Recognition , 4th ed. Academic Press, Burlington, MA. · Zbl 1093.68103
[40] Verkerk, R., Schreiner, M., Krumbein, A., Ciska, E., Holst, B., Rowland, I., Schrijver, R. D., Hansen, M., Gerhäuser, C., Mithen, R. and Dekker, M. (2009). Glucosinolates in Brassica vegetables: The influence of the food supply chain on intake, bioavailability and human health. Mol. Nutr. Food Res. 53 Suppl 2 S219.
[41] Waaijenborg, S., Verselewel de Witt Hamer, P. C. and Zwinderman, A. H. (2008). Quantifying the association between gene expressions and DNA-markers by penalized canonical correlaton analysis. Stat. Appl. Genet. Mol. Biol. 7 Art. 3, 29. · Zbl 1276.92077
[42] Wang, Y. X. R., Jiang, K., Feldman, L. J., Bickel, P. J. and Huang, H. (2015). Supplement to “Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis.” . · Zbl 1454.62416
[43] Wang, Y. X. R. and Huang, H. (2014). Review on statistical methods for gene network reconstruction using expression data. J. Theoret. Biol. 362 53-61. · Zbl 1307.92099
[44] Ward, J. H. Jr. (1963). Hierarchical grouping to optimize an objective function. J. Amer. Statist. Assoc. 58 236-244.
[45] Wille, A. and Bühlmann, P. (2006). Low-order conditional independence graphs for inferring genetic networks. Stat. Appl. Genet. Mol. Biol. 5 Art. 1, 34 pp. (electronic). · Zbl 1166.62374
[46] Wille, A., Zimmermann, P., Vranova, E., Fürholz, A., Laule, O., Bleuler, S., Hennig, L., Prelic, A., von Rohr, P., Thiele, L., Zitzler, E., Gruissem, W. and Bühlmann, P. (2004). Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biology 5 1-13.
[47] Witten, D. M. and Tibshirani, R. (2009). Extensions of sparse canonical correlation analysis with applications to genomic data. Stat. Appl. Genet. Mol. Biol. 8 1-27. · Zbl 1276.62099
[48] Witten, D. M., Tibshirani, R. and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10 515-534.
[49] Woo, H.-H., Jeong, B. R. and Hawes, M. C. (2005). Flavonoids: From cell cycle regulation to biotechnology. Biotechnol. Lett. 27 365-374.
[50] Yan, X. and Chen, S. (2007). Regulation of plant glucosinolate metabolism. Planta 226 1343-1352.
[51] Zhou, S., Rütimann, P., Xu, M. and Bühlmann, P. (2011). High-dimensional covariance estimation based on Gaussian graphical models. J. Mach. Learn. Res. 12 2975-3026. · Zbl 1280.62065
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.