×

zbMATH — the first resource for mathematics

Learning causal structure from mixed data with missing values using Gaussian copula models. (English) Zbl 1430.62099
Summary: We consider the problem of causal structure learning from data with missing values, assumed to be drawn from a Gaussian copula model. First, we extend the “Rank PC” algorithm, designed for Gaussian copula models with purely continuous data (so-called nonparanormal models), to incomplete data by applying rank correlation to pairwise complete observations and replacing the sample size with an effective sample size in the conditional independence tests to account for the information loss from missing values. When the data are missing completely at random (MCAR), we provide an error bound on the accuracy of “Rank PC” and show its high-dimensional consistency. However, when the data are missing at random (MAR), “Rank PC” fails dramatically. Therefore, we propose a Gibbs sampling procedure to draw correlation matrix samples from mixed data that still works correctly under MAR. These samples are translated into an average correlation matrix and an effective sample size, resulting in the “Copula PC” algorithm for incomplete data. Simulation study shows that: (1) “Copula PC” estimates a more accurate correlation matrix and causal structure than “Rank PC” under MCAR and, even more so, under MAR and (2) the usage of the effective sample size significantly improves the performance of “Rank PC” and “Copula PC”. We illustrate our methods on two real-world datasets: riboflavin production data and chronic fatigue syndrome data.

MSC:
62H05 Characterization and structure theory for multivariate probability distributions; copulas
62D10 Missing data
68T05 Learning and adaptive systems in artificial intelligence
Software:
bfa; pcalg; Polycor; sbgcop; TETRAD
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Wiley, New York (2003) · Zbl 1039.62044
[2] Baraldi, AN; Enders, CK, An introduction to modern missing data analyses, J. Sch. Psychol., 48, 5-37, (2010)
[3] Beinlich, I.A., Suermondt, H.J., Chavez, R.M., Cooper, G.F.: The ALARM monitoring system: a case study with two probabilistic inference techniques for belief networks. In: European Conference on Artificial Intelligence in Medicine, pp. 247-256. Springer, Berlin (1989)
[4] Budhathoki, K., Vreeken, J.: Causal inference by compression. In: International Conference on Data Mining, pp. 41-50. IEEE (2016)
[5] Bühlmann, P.; Kalisch, M.; Meier, L., High-dimensional statistics with a view toward applications in biology, Annu. Rev. Stat. Appl., 1, 255-278, (2014)
[6] Chen, Z., Zhang, K., Chan, L.: Nonlinear causal discovery for high dimensional data: a kernelized trace method. In: International Conference on Data Mining, pp. 1003-1008. IEEE (2013)
[7] Chickering, DM, Learning equivalence classes of Bayesian-network structures, J. Mach. Learn. Res., 2, 445-498, (2002) · Zbl 1007.68179
[8] Chickering, DM, Optimal structure identification with greedy search, J. Mach. Learn. Res., 3, 507-554, (2002) · Zbl 1084.68519
[9] Claassen, T., Mooij, J., Heskes, T.: Learning sparse causal models is not NP-hard. In: Conference on Uncertainty in Artificial Intelligence, pp. 172-181 (2013)
[10] Colombo, D., Maathuis, M.H., Kalisch, M., Richardson, T.S.: Learning high-dimensional directed acyclic graphs with latent and selection variables. Ann. Stat. 40(1), 294-321 (2012) · Zbl 1246.62131
[11] Cui, R., Groot, P., Heskes, T.: Copula PC algorithm for causal discovery from mixed data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 377-392. Springer, Berlin (2016)
[12] Cui, R., Groot, P., Heskes, T.: Robust estimation of Gaussian copula causal structure from mixed data with missing values. In: IEEE International Conference on Data Mining, pp. 835-840. IEEE (2017)
[13] Dezeure, R.; Bühlmann, P.; Meier, L.; Meinshausen, N.; etal., High-dimensional inference: confidence intervals. \(p\)-values and R-software hdi, Stat. Sci., 30, 533-558, (2015) · Zbl 1426.62183
[14] Didelez, V.; Pigeot, I., Maximum likelihood estimation in graphical models with missing values, Biometrika, 85, 960-966, (1998) · Zbl 1101.62315
[15] Dobra, A.; Lenkoski, A.; etal., Copula Gaussian graphical models and their application to modeling functional disability data, Ann. Appl. Stat., 5, 969-993, (2011) · Zbl 1232.62046
[16] Fan, J.; Liu, H.; Ning, Y.; Zou, H., High dimensional semiparametric latent graphical model for mixed data, J. R. Stat. Soc. Ser. B. Stat. Methodol., 79, 405-421, (2017) · Zbl 1414.62179
[17] Fox, J.: Polycor: polychoric and polyserial correlations. R package version 0.7-5. http://CRAN.R-project.org/package=polycor (2007)
[18] Gruhl, J.; Erosheva, EA; Crane, PK; etal., A semiparametric approach to mixed outcome latent variable models: estimating the association between cognition and regional brain volumes, Ann. Appl. Stat., 7, 2361-2383, (2013) · Zbl 1283.62218
[19] Harris, N.; Drton, M., PC algorithm for nonparanormal graphical models, J. Mach. Learn. Res., 14, 3365-3383, (2013) · Zbl 1318.62197
[20] Heins, MJ; Knoop, H.; Burk, WJ; Bleijenberg, G., The process of cognitive behaviour therapy for chronic fatigue syndrome: which changes in perpetuating cognitions and behaviour are related to a reduction in fatigue?, J. Psychosom. Res., 75, 235-241, (2013)
[21] Herdin, M., Czink, N., Ozcelik, H., Bonek, E.: Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels. In: Vehicular Technology Conference, 2005. VTC 2005-Spring. 2005 IEEE 61st, vol. 1, pp. 136-140. IEEE (2005)
[22] Hoeffding, W., Probability inequalities for sums of bounded random variables, J. Am. Stat. Assoc., 58, 13-30, (1963) · Zbl 0127.10602
[23] Hoff, P.D.: Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Stat. 1(1), 265-283 (2007) · Zbl 1129.62050
[24] Hoff, P.D.: sbgcop: semiparametric Bayesian Gaussian copula estimation and imputation. R package version 0.975 (2010)
[25] Hoff, PD; Niu, X.; Wellner, JA, Information bounds for Gaussian copulas, Bernoulli, 20, 604, (2014) · Zbl 1321.62054
[26] Kalaitzis, A., Silva, R.: Flexible sampling of discrete data correlations without the marginal distributions. In: Advances in Neural Information Processing Systems, pp. 2517-2525 (2013)
[27] Kalisch, M.; Bühlmann, P., Estimating high-dimensional directed acyclic graphs with the PC-algorithm, J. Mach. Learn. Res., 8, 613-636, (2007) · Zbl 1222.68229
[28] Kalisch, M., Mächler, M., Colombo, D.: pcalg: estimation of CPDAG/PAG and causal inference using the IDA algorithm. http://CRAN.R-project.org/package=pcalg (2010)
[29] Kendall, M.G.: Rank Correlation Methods. Griffin, London (1948) · Zbl 0032.17602
[30] Kolar, M., Xing, E.P.: Estimating sparse precision matrices from data with missing values. In: International Conference on Machine Learning (2012)
[31] Kruskal, WH, Ordinal measures of association, J. Am. Stat. Assoc., 53, 814-861, (1958) · Zbl 0087.15403
[32] Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on graphical structures and their application to expert systems. J. R. Stat. Soc. Ser. B. Stat. Methodol. 50(2), 157-224 (1988) · Zbl 0684.68106
[33] Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (1987) · Zbl 0665.62004
[34] Liu, H.; Han, F.; Yuan, M.; Lafferty, J.; Wasserman, L.; etal., High-dimensional semiparametric Gaussian copula graphical models, Ann. Stat., 40, 2293-2326, (2012) · Zbl 1297.62073
[35] Lounici, K., High-dimensional covariance matrix estimation with missing observations, Bernoulli, 20, 1029-1058, (2014) · Zbl 1320.62124
[36] Magliacane, S., Claassen, T., Mooij, J.M.: Ancestral causal inference. In: Advances in Neural Information Processing Systems, pp. 4466-4474 (2016)
[37] Middleton, S.; McElduff, P.; Ward, J.; Grimshaw, JM; Dale, S.; D’Este, C.; Drury, P.; Griffiths, R.; Cheung, NW; Quinn, C.; etal., Implementation of evidence-based treatment protocols to manage fever, hyperglycaemia, and swallowing dysfunction in acute stroke (QASC): a cluster randomised controlled trial, Lancet, 378, 1699-1706, (2011)
[38] Murray, JS; Dunson, DB; Carin, L.; Lucas, JE, Bayesian Gaussian copula factor models for mixed data, J. Am. Stat. Assoc., 108, 656-665, (2013) · Zbl 06195968
[39] Muthén, B., A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators, Psychometrika, 49, 115-132, (1984)
[40] Nelsen, R.B.: An Introduction to Copulas. Springer, Berlin (2007) · Zbl 1152.62030
[41] Pearl, J.: Causality. Cambridge University Press, Cambridge (2009) · Zbl 1188.68291
[42] Pearl, J.; Verma, TS, A statistical semantics for causation, Stat. Comput., 2, 91-95, (1992)
[43] Peters, J.; Mooij, JM; Janzing, D.; Schölkopf, B.; etal., Causal discovery with continuous additive noise models, J. Mach. Learn. Res., 15, 2009-2053, (2014) · Zbl 1318.68151
[44] Poleto, FZ; Singer, JM; Paulino, CD, Missing data mechanisms and their implications on the analysis of categorical data, Stat. Comput., 21, 31-43, (2011) · Zbl 1274.62652
[45] Rahmadi, R.; Groot, P.; Heins, M.; Knoop, H.; Heskes, T.; etal., Causality on cross-sectional data: stable specification search in constrained structural equation modeling, Appl. Soft. Comput., 52, 687-698, (2017)
[46] Ramsey, J., Zhang, J., Spirtes, P.L.: Adjacency-Faithfulness and Conservative Causal Inference. arXiv preprint arXiv:1206.6843 (2012)
[47] Rubin, DB, Inference and missing data, Biometrika, 63, 581-592, (1976) · Zbl 0344.62034
[48] Schafer, JL; Graham, JW, Missing data: our view of the state of the art, Psychol. Methods, 7, 147, (2002)
[49] Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT Press, Cambridge (2000) · Zbl 0806.62001
[50] Städler, N.; Bühlmann, P., Missing values: sparse inverse covariance estimation and an extension to sparse regression, Stat. Comput., 22, 219-235, (2012) · Zbl 1322.62115
[51] Strobl, E.V., Visweswaran, S., Spirtes, P.L.: Fast Causal Inference with Non-random Missingness by Test-Wise Deletion. arXiv preprint arXiv:1705.09031 (2017)
[52] Triantafillou, S.; Tsamardinos, I., Constraint-based causal discovery from multiple interventions over overlapping variable sets, J. Mach. Learn. Res., 16, 2147-2205, (2015) · Zbl 1351.68239
[53] Tsamardinos, I.; Brown, LE; Aliferis, CF, The max – min hill-climbing Bayesian network structure learning algorithm, Mach. Learn., 65, 31-78, (2006)
[54] Wang, H., Fazayeli, F., Chatterjee, S., Banerjee, A., Steinhauser, K., Ganguly, A., Bhattacharjee, K., Konar, A., Nagar, A.: Gaussian copula precision estimation with missing values. In: International Conference on Artificial Intelligence and Statistics, pp. 978-986 (2014)
[55] Wang, J., Loong, B., Westveld, A.H., Welsh, A.H.: A Copula-Based Imputation Model for Missing Data of Mixed Type in Multilevel Data Sets. arXiv preprint arXiv:1702.08148 (2017)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.