Dirichlet process mixture models for modeling and generating synthetic versions of nested categorical data. (English) Zbl 06873723

Summary: We present a Bayesian model for estimating the joint distribution of multivariate categorical data when units are nested within groups. Such data arise frequently in social science settings, for example, people living in households. The model assumes that (i) each group is a member of a group-level latent class, and (ii) each unit is a member of a unit-level latent class nested within its group-level latent class. This structure allows the model to capture dependence among units in the same group. It also facilitates simultaneous modeling of variables at both group and unit levels. We develop a version of the model that assigns zero probability to groups and units with physically impossible combinations of variables. We apply the model to estimate multivariate relationships in a subset of the American Community Survey. Using the estimated model, we generate synthetic household data that could be disseminated as redacted public use files. Supplementary materials for this article are available online.


62H30 Classification and discrimination; cluster analysis (statistical aspects)
62P25 Applications of statistics to social sciences


Full Text: DOI arXiv Euclid


[1] Abowd, J., Stinson, M., and Benedetto, G. (2006). “Final Report to the Social Security Administration on the SIPP/SSA/IRS Public Use File Project.” Technical report, U.S. Census Bureau Longitudinal Employer-Household Dynamics Program. Available at http://www.census.gov/sipp/synth_data.html.
[2] Albert, J. H. and Chib, S. (1993). “Bayesian analysis of binary and polychotomous response data.” Journal of the American Statistical Association, 88: 669-679. · Zbl 0774.62031
[3] Bennink, M., Croon, M. A., Kroon, B., and Vermunt, J. K. (2016). “Micro-macro multilevel latent class models with multiple discrete individual-level variables.” Advances in Data Analysis and Classification, 10(2): 139-154. · Zbl 1414.62228
[4] Dunson, D. B. and Xing, C. (2009). “Nonparametric Bayes modeling of multivariate categorical data.” Journal of the American Statistical Association, 104: 1042-1051. · Zbl 1388.62151
[5] Fellegi, I. P. and Holt, D. (1976). “A systematic approach to automatic edit and imputation.” Journal of the American Statistical Association, 71: 17-35.
[6] Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis. London: Chapman & Hall. · Zbl 1279.62004
[7] Goodman, L. A. (1974). “Exploratory latent structure analysis using both identifiable and unidentifiable models.” Biometrika, 61: 215-231. · Zbl 0281.62057
[8] Hawala, S. (2008). “Producing partially synthetic data to avoid disclosure.” In Proceedings of the Joint Statistical Meetings. Alexandria, VA: American Statistical Association.
[9] Hoff, P. D. (2009). A First Course in Bayesian Statistical Methods. New York: Springer. · Zbl 1213.62044
[10] Hu, J., Reiter, J. P., and Wang, Q. (2014). “Disclosure risk evaluation for fully synthetic categorical data.” In Domingo-Ferrer, J. (ed.), Privacy in Statistical Databases, 185-199. Springer.
[11] Hu, J., Reiter, J. P., and Wang, Q. (2017). “Supplementary Materials for “Dirichlet Process Mixture Models for Modeling and Generating Synthetic Versions of Nested Categorical Data“.” Bayesian Analysis. · Zbl 06873723
[12] Ishwaran, H. and James, L. F. (2001). “Gibbs sampling methods for stick-breaking priors.” Journal of the American Statistical Association, 161-173. · Zbl 1014.62006
[13] Jain, S. and Neal, R. M. (2007). “Splitting and merging components of a nonconjugate Dirichlet process mixture model.” Bayesian Analysis, 2: 445-472. · Zbl 1331.62145
[14] Kim, H. J., Cox, L. H., Karr, A. F., Reiter, J. P., and Wang, Q. (2015). “Simultaneous editing and imputation for continuous data.” Journal of the American Statistical Association, 110: 987-999.
[15] Kinney, S., Reiter, J. P., Reznek, A. P., Miranda, J., Jarmin, R. S., and Abowd, J. M. (2011). “Towards unrestricted public use business microdata: The synthetic Longitudinal Business Database.” International Statistical Review, 79: 363-384.
[16] Kunihama, T., Herring, A. H., Halpern, C. T., and Dunson, D. B. (2014). “Nonparametric Bayes modeling with sample survey weights.” arXiv:1409.5914. · Zbl 1384.62031
[17] Little, R. J. A. (1993). “Statistical analysis of masked data.” Journal of Official Statistics, 9: 407-426.
[18] Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). “Privacy: Theory meets practice on the map.” In IEEE 24th International Conference on Data Engineering, 277-286.
[19] Manrique-Vallier, D. and Reiter, J. P. (2014). “Bayesian estimation of discrete multivariate latent structure models with structural zeros.” Journal of Computational and Graphical Statistics, 23: 1061-1079.
[20] Manrique-Vallier, D. and Reiter, J. P. (forthcoming). “Bayesian simultaneous edit and imputation for multivariate categorical data.” Journal of the American Statistical Association, to appear.
[21] Murray, J. S. and Reiter, J. P. (forthcoming). “Multiple imputation of missing categorical and continuous values via Bayesian mixture models with local dependence.” Journal of the American Statistical Association, to appear.
[22] Raghunathan, T. E., Reiter, J. P., and Rubin, D. B. (2003). “Multiple imputation for statistical disclosure limitation.” Journal of Official Statistics, 19: 1-16.
[23] Reiter, J. and Raghunathan, T. E. (2007). “The multiple adaptations of multiple imputation.” Journal of the American Statistical Association, 102: 1462-1471. · Zbl 1332.62044
[24] Reiter, J. P. (2003). “Inference for partially synthetic, public use microdata sets.” Survey Methodology, 29: 181-189.
[25] Reiter, J. P. (2005). “Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study.” Journal of the Royal Statistical Society, Series A, 168: 185-205. · Zbl 1099.62138
[26] Rodriguez, A., Dunson, D. B., and Gelfand, A. E. (2008). “The nested Dirichelt process.” Journal of the American Statistical Association, 103: 1131-1154. · Zbl 1205.62062
[27] Rubin, D. B. (1993). “Discussion: Statistical disclosure limitation.” Journal of Official Statistics, 9: 462-468.
[28] Ruggles, S., Alexander, J. T., Genadek, K., Goeken, R., Schroeder, M. B., and Sobek, M. (2010). “Integrated Public Use Microdata Series: Version 5.0 [Machine-readable database].” Minneapolis: University of Minnesota.
[29] Schifeling, T. and Reiter, J. P. (2016). “Incorporating marginal prior information in latent class models.” Bayesian Analysis, 2: 499-518. · Zbl 1357.62130
[30] Sethuraman, J. (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 4: 639-650. · Zbl 0823.62007
[31] Si, Y. and Reiter, J. P. (2013). “Nonparametric Bayesian multiple imputation for incomplete categorical variables in large-scale assessment surveys.” Journal of Educational and Behavioral Statistics, 38: 499-521.
[32] Vermunt, J. K. (2003). “Multilevel latent class models.” Sociological Methodology, 213-239.
[33] Vermunt, J. K. (2008). “Latent class and finite mixture models for multilevel data sets.” Statistical Methods in Medical Research, 33-51. · Zbl 1154.62086
[34] Wade, S., Mongelluzzo, S., and Petrone, S. (2011). “An enriched conjugate prior for Bayesian nonparametric inference.” Bayesian Analysis, 6: 359-385. · Zbl 1330.62219
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.