Learning causal structure from mixed data with missing values using Gaussian copula models. (English) Zbl 1430.62099
Summary: We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the “Rank PC” algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. When the data are missing completely at random (MCAR), we provide an error bound on the accuracy of “Rank PC” and show its high-dimensional consistency. However, when the data are missing at random (MAR), “Rank PC” fails dramatically. Therefore, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data that still works correctly under MAR. These samples are translated into an average correlation matrix and an effective sample size, resulting in the “Copula PC” algorithm for incomplete data. Simulation study shows that: (1) “Copula PC” estimates a more accurate correlation matrix and causal structure than “Rank PC” under MCAR and, even more so, under MAR and (2) the usage of the effective sample size significantly improves the performance of “Rank PC” and “Copula PC”. We illustrate our methods on two real-world datasets: riboflavin production data and chronic fatigue syndrome data.

62H05 Characterization and structure theory for multivariate probability distributions; copulas
62D10 Missing data
68T05 Learning and adaptive systems in artificial intelligence
bfa; pcalg; Polycor; sbgcop; TETRAD
