Decomposition and model selection for large contingency tables.

*(English)*Zbl 1207.62126Summary: Large contingency tables summarizing categorical variables arise in many areas. One example is in biology, where large numbers of biomarkers are cross-tabulated according to their discrete expression level. Interactions of the variables are of great interest and are generally studied with log-linear models. The structure of a log-linear model can be visually represented by a graph from which the conditional independence structure can then be easily read off. However, since the number of parameters in a saturated model grows exponentially in the number of variables, this generally comes with a heavy computational burden. Even if we restrict ourselves to models of lower-order interactions or other sparse structures, we are faced with the problem of a large number of cells which play the role of sample size. This is in sharp contrast to high-dimensional regression or classification procedures because, in addition to a high-dimensional parameter, we also have to deal with the analogue of a huge sample size. Furthermore, high-dimensional tables naturally feature a large number of sampling zeros which often leads to the nonexistence of the maximum likelihood estimate. We therefore present a decomposition approach, where we first divide the problem into several lower-dimensional problems and then combine these to form a global solution. Our methodology is computationally feasible for log-linear interaction models with many categorical variables each or some of them having many levels. We demonstrate the proposed method on simulated data and apply it to a bio-medical problem in cancer research.

##### MSC:

62H17 | Contingency tables |

62J12 | Generalized linear models (logistic models) |

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

05C90 | Applications of graph theory |

65C60 | Computational problems in statistics (MSC2010) |

62-04 | Software, source code, etc. for problems pertaining to statistics |

##### Keywords:

categorical data; graphical model; group lasso; log-linear models; sparse contingency tables##### Software:

MICE
Full Text:
DOI

##### References:

[1] | Bishop, Discrete Multivariate Analysis (1975) |

[2] | Breiman, Random forests, Machine Learning 45 pp 5– (2001) · Zbl 1007.68152 |

[3] | Christensen, Linear Models for Multivariate, Time Series, and Spatial Data (1991) · Zbl 0717.62079 |

[4] | Dahinden, Mining tissue microarray data to uncover combinations of biomarker expression patterns that improve intermediate staging and grading of clear cell renal cell cancer, Clinical Cancer Research 16 pp 88– (2010) |

[5] | Dahinden, Penalized likelihood for sparse contingency tables with an application to full-length cDNA libraries, BMC Bioinformatics 8 pp 476– (2007) |

[6] | Darroch, Markov fields and log-linear interaction models for contingency tables, Annals of Statistics 8 pp 522– (1980) · Zbl 0444.62064 |

[7] | Imai, Hypoxia attenuates the expression of E-cadherin via up-regulation of SNAIL in ovarian carcinoma cells, The American Journal of Pathology 163 pp 1437– (2003) |

[8] | Jackson, L., Gray, A. and Fienberg, S. ( 2007). Sequential category aggregation and partitioning approach for multi-way contingency tables based on survey and census data, preprint. · Zbl 1149.62049 |

[9] | Kallioniemi, Tissue microarray technology for high-throughput molecular profiling of cancer, Human Molecular Genetics 10 pp 657– (2001) |

[10] | Kim, S. ( 2005). Log-linear modelling for contingency tables by using marginal model structures. Research Report 05, Division of Applied Mathematics, Korea Advanced Institute of Science and Technology. |

[11] | Kim, In vitro transcriptional activation of p21 promoter by p53, Biochemical and Biophysical Research Communication 234 pp 300– (1997) |

[12] | Lauritzen (1996) |

[13] | Mazal, Expression of aquaporins and PAX-2 compared to CD10 and cytokeratin 7 in renal neoplasms: a tissue microarray study, Modern Pathology 18 pp 535– (2005) |

[14] | Olesen, Maximal prime subgraph decomposition of Bayesian networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B 32 pp 21– (2002) |

[15] | Osipov, Expression of p27 and VHL in renal tumors, Applied Immunohistochemistry & Molecular Morphology 10 pp 344– (2002) |

[16] | Ravikumar, High-dimensional graphical model selection using l1-regularized logistic regression, Annals of Statistics (2009) |

[17] | Roe, p53 stabilization and transactivation by a von Hippel-Lindau protein, Molecular Cell 22 pp 395– (2006) |

[18] | Strobl, Bias in random forest variable importance measures: illustrations, sources and a solution, BMC Bioinformatics 8 pp 25– (2007) |

[19] | Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society B 58 pp 267– (1996) · Zbl 0850.62538 |

[20] | van Buuren, S. and Oudshoorn, C. ( 2007). Mice: multivariate imputation by chained equations. R package version 1.16. http://web.inter.nl.net/users/S.van.Buuren/mi/html/mice.htm. |

[21] | Wainwright, Advances in Neural Information Processing Systems 19 pp 1465– (2007) |

[22] | Wenger, R., Stiehl, D. and Camenisch, G. ( 2005). Integration of oxygen signaling at the consensus HRE. Science Signaling: Signal Transduction Knowledge Environment (STKE) 2005, re12. |

[23] | Yuan, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society B 68 pp 49– (2006) · Zbl 1141.62030 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.