Bayesian nonparametric disclosure risk estimation via mixed effects log-linear models. (English) Zbl 1454.62107

Summary: Statistical agencies and other institutions collect data under the promise to protect the confidentiality of respondents. When releasing microdata samples, the risk that records can be identified must be assessed. To this aim, a widely adopted approach is to isolate categorical variables key to the identification and analyze multi-way contingency tables of such variables. Common disclosure risk measures focus on sample unique cells in these tables and adopt parametric log-linear models as the standard statistical tools for the problem. Such models often have to deal with large and extremely sparse tables that pose a number of challenges to risk estimation. This paper proposes to overcome these problems by studying nonparametric alternatives based on Dirichlet process random effects. The main finding is that the inclusion of such random effects allows us to reduce considerably the number of fixed effects required to achieve reliable risk estimates. This is studied on applications to real data, suggesting, in particular, that our mixed models with main effects only produce roughly equivalent estimates compared to the all two-way interactions models, and are effective in defusing potential shortcomings of traditional log-linear models. This paper adopts a fully Bayesian approach that accounts for all sources of uncertainty, including that about the population frequencies, and supplies unconditional (posterior) variances and credible intervals.


62G05 Nonparametric estimation
62F15 Bayesian inference
62H17 Contingency tables
Full Text: DOI arXiv Euclid


[1] Blackwell, D. and MacQueen, J. B. (1973). Ferguson distributions via Pólya urn schemes. Ann. Statist. 1 353-355. · Zbl 0276.62010
[2] Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res. 3 993-1022. · Zbl 1112.68379
[3] Carlson, M. (2002). Assessing microdata disclosure risk using the Poisson-inverse Gaussian distribution. Statistics in Transition 5 901-925.
[4] Dorazio, R. M., Mukherjee, B., Zhang, L., Ghosh, M., Jelks, H. L. and Jordan, F. (2008). Modeling unobserved sources of heterogeneity in animal abundance using a Dirichlet process prior. Biometrics 64 635-644, 670-671. · Zbl 1137.62084
[5] Elamir, E. A. H. and Skinner, C. J. (2006). Record level measures of disclosure risk for survey microdata. Journal of Official Statistics 22 525-539.
[6] Erosheva, E. A., Fienberg, S. E. and Joutard, C. (2007). Describing disability through individual-level mixture models for multivariate binary data. Ann. Appl. Stat. 1 502-537. · Zbl 1126.62101
[7] Escobar, M. D. and West, M. (1995). Bayesian density estimation and inference using mixtures. J. Amer. Statist. Assoc. 90 577-588. · Zbl 0826.62021
[8] Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1 209-230. · Zbl 0255.62037
[9] Fienberg, S. E. and Makov, U. E. (1998). Confidentiality, uniqueness, and disclosure limitation for categorical data. Journal of Official Statistics 14 385-397. · Zbl 0921.62011
[10] Fienberg, S. E. and Rinaldo, A. (2007). Three centuries of categorical data analysis: log-linear models and maximum likelihood estimation. J. Statist. Plann. Inference 137 3430-3445. · Zbl 1119.62053
[11] Fienberg, S. E. and Rinaldo, A. (2012). Maximum likelihood estimation in log-linear models. Ann. Statist. 40 996-1023. · Zbl 1274.62389
[12] Filippone, M., Mira, A. and Girolami, M. (2011). Discussion of: “Sampling schemes for generalized linear Dirichlet process random effects models”, by M. Kyung, J. Gill, and G. Casella [MR2859768]. Stat. Methods Appl. 20 295-297. · Zbl 1241.65004
[13] Forster, J. J. and Webb, E. L. (2007). Bayesian disclosure risk assessment: Predicting small frequencies in contingency tables. J. R. Stat. Soc. Ser. C. Appl. Stat. 56 551-570.
[14] Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statist. Sci. 7 457-472. · Zbl 1386.65060
[15] Girolami, M. and Calderhead, B. (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. J. R. Stat. Soc. Ser. B. Stat. Methodol. 73 123-214.
[16] Haberman, S. J. (1974). The Analysis of Frequency Data . Univ. Chicago Press, Chicago, IL. · Zbl 0325.62017
[17] Johnson, N. L., Kotz, S. and Balakrishnan, N. (1997). Discrete Multivariate Distributions . Wiley, New York. · Zbl 0868.62048
[18] Liu, J. S. (1996). Nonparametric hierarchical Bayes via sequential imputations. Ann. Statist. 24 911-930. · Zbl 0880.62038
[19] Lo, A. Y. (1984). On a class of Bayesian nonparametric estimates. I. Density estimates. Ann. Statist. 12 351-357. · Zbl 0557.62036
[20] Manrique-Vallier, D. and Reiter, J. P. (2012). Estimating identification disclosure risk using mixed membership models. J. Amer. Statist. Assoc. 107 1385-1394. · Zbl 1258.62029
[21] Manrique-Vallier, D. and Reiter, J. P. (2014). Bayesian estimation of discrete multivariate latent structure models with structural zeros. J. Comput. Graph. Statist. 23 1061-1079.
[22] Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report No.CRG-TR-93-1, Dept. of Computer Science, Univ. Toronto.
[23] Neal, R. M. (2000). Markov chain sampling methods for Dirichlet process mixture models. J. Comput. Graph. Statist. 9 249-265.
[24] Rinott, Y. and Shlomo, N. (2006). A generalized negative binomial smoothing model for sample disclosure risk estimation. In Privacy in Statistical Databases (J. Domingo-Ferrer and L. Franconi, eds.). Lecture Notes in Computer Science 4302 82-93. Springer, Berlin.
[25] Rinott, Y. and Shlomo, N. (2007a). A smoothing model for sample disclosure risk estimation. In Complex Datasets and Inverse Problems (R. Liu, W. Strawderman and C.-H. Zhang, eds.). Institute of Mathematical Statistics Lecture Notes-Monograph Series 54 161-171. IMS, Beachwood, OH.
[26] Rinott, Y. and Shlomo, N. (2007b). Variances and confidence intervals for sample disclosure risk measures. In Bulletin of the International Statistical Institute : Proceedings of the 56 th Session of the International Statistical Institute , ISI’ 07, Lisbon. August 22 - 29 1090-1096.
[27] Roberts, G. O. and Rosenthal, J. S. (2009). Examples of adaptive MCMC. J. Comput. Graph. Statist. 18 349-367.
[28] Ruggles, S., Alexander, J. T., Genadek, K., Goeken, R., Schroeder, M. B. and Sobek, M. (2010). Integrated public use microdata series: Version 5.0 [Machine-readable database]. University of Minnesota, Minneapolis. Available at .
[29] Si, Y. and Reiter, J. P. (2013). Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys. Journal of Educational and Behavioral Statistics 38 499-521.
[30] Skinner, C. J. and Holmes, D. J. (1998). Estimating the re-identification risk per record in microdata. Journal of Official Statistics 14 361-372.
[31] Skinner, C. and Shlomo, N. (2008). Assessing identification risk in survey microdata using log-linear models. J. Amer. Statist. Assoc. 103 989-1001. · Zbl 1205.62220
[32] Takemura, A. (1999). Some superpopulation models for estimating the number of population uniques. In Proceedings of the Conference on Statistical Data Protection 45-58. Eurostat, Luxembourg.
[33] Teh, Y. W., Jordan, M. I., Beal, M. J. and Blei, D. M. (2006). Hierarchical Dirichlet processes. J. Amer. Statist. Assoc. 101 1566-1581. · Zbl 1171.62349
[34] Tierney, L. and Kadane, J. B. (1986). Accurate approximations for posterior moments and marginal densities. J. Amer. Statist. Assoc. 81 82-86. · Zbl 0587.62067
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.