×

zbMATH — the first resource for mathematics

Confidentiality and differential privacy in the dissemination of frequency tables. (English) Zbl 1403.62229
Summary: For decades, national statistical agencies and other data custodians have been publishing frequency tables based on census, survey and administrative data. In order to protect the confidentiality of individuals represented in the data, tables based on original data are modified before release. Recently, in response to user demand for more flexible and responsive table publication services, frequency table publication schemes have been augmented with on-line table generating servers such as the US Census Bureau FactFinder and the Australian Bureau of Statistics (ABS) TableBuilder. These systems allow users to build their own custom tables, and make use of automated perturbation routines to protect confidentiality. Motivated by the growing popularity of table generating servers, in this paper we study confidentiality protection for perturbed frequency tables, including the trade-off with analytical utility, focusing on a version of the ABS TableBuilder as a concrete example of a data release mechanism, and examining its properties. Confidentiality protection is assessed in terms of the differential privacy standard, and this paper can be used as a practical introduction to differential privacy, to calculations related to its application, to the relationship between confidentiality protection and utility and to confidentiality in general.

MSC:
62P25 Applications of statistics to social sciences
62H17 Contingency tables
Software:
PrivateLR
PDF BibTeX XML Cite
Full Text: DOI Euclid
References:
[1] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K. and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security 308–318. ACM, New York.
[2] Andersson, K., Jansson, I. and Kraft, K. (2015). Protection of frequency tables—current work at statistics Sweden. In Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (Helsinki, Finland, 5–7 October). 20 pp.
[3] Auguste, K. (1883). La cryptographie militaire. J. Sci. Mil.9 538.
[4] Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F. and Talwar, K. (2007). Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of the 26th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS) 273–282.
[5] Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd ed. Springer, New York. · Zbl 0572.62008
[6] Bishop, Y. M. M., Fienberg, S. E. and Holland, P. W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. · Zbl 0332.62039
[7] Brenner, H. and Nissim, K. (2010). Impossibility of differentially private universally optimal mechanisms. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on 71–80. IEEE, New York. · Zbl 1320.68072
[8] Charest, A.-S. (2010). How can we analyse differentially-private synthetic datasets? J. Priv. Confid.2 21–33.
[9] Chaudhuri, K. and Mishra, N. (2006). When random sampling preserves privacy. In Proceedings of the 26th Annual International Conference on Advances in Cryptology: CRYPTO 2006 (C. Dwork, ed.). LNCS4117 198–213. Springer, Berlin. · Zbl 1161.94438
[10] Chipperfield, J., Gow, D. and Loong, B. (2016). The Australian Bureau of Statistics and releasing frequency tables via a remote server. Stat. J. IAOS32 53–64.
[11] Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory, 2nd ed. Wiley, New York. · Zbl 1140.94001
[12] Drechsler, J. (2012). New data dissemination approaches in old Europe—synthetic datasets for a German establishment survey. J. Appl. Stat.39 243–265.
[13] Drechsler, J. and Reiter, J. P. (2011). An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Comput. Statist. Data Anal.55 3232–3243.
[14] Duncan, G. T., Elliot, M. and Salazar-Gonzàlez, J. J. (2011). Statistical Confidentiality. Springer, New York.
[15] Duncan, G. T., Fienberg, S. E., Krishnan, R., Padman, R. and Roehrig, S. F. (2001). Disclosure limitation methods and information loss for tabular data. In Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies 135–166.
[16] Dwork, C. (2006). Differential privacy. In ICALP 2006 (M. Bugliesi, B. Preneel, V. Sassone and I. Wegener, eds.). Lecture Notes in Computer Science4052 1–12. Springer, Heidelberg. · Zbl 1133.68330
[17] Dwork, C. and Roth, A. (2014). The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci.9 211–407. · Zbl 1302.68109
[18] Dwork, C. and Rothblum, G. N. (2016). Concentrated differential privacy. Preprint. Available at arXiv:1603.01887.
[19] Dwork, C., Rothblum, G. N. and Vadhan, S. (2010). Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science—FOCS 2010 51–60. IEEE Computer Soc., Los Alamitos, CA.
[20] Dwork, C., McSherry, F., Nissim, K. and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In 3rd IACR Theory of Cryptography Conference 265–284. · Zbl 1112.94027
[21] Evett, I. W., Jackson, G., Lambert, J. A. and McCrossan, S. (2000). The impact of the principles of evidence interpretation on the structure and content of statements. Sci. Justice40 233–239.
[22] Fellegi, I. P. (1972). On the question of statistical confidentiality. J. Amer. Statist. Assoc.67 7–18. · Zbl 0243.62001
[23] Fienberg, S. E., Rinaldo, A. and Yang, X. (2010). Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. In PSD’2010 Privacy in Statistical Databases (J. Domingo-Ferrer and E. Magkos, eds.). LNCS6344 187–199. Springer, Berlin.
[24] Fienberg, S. E. and Slavković, A. B. (2008). A survey of statistical approaches to preserving confidentiality of contingency table entries. In Privacy-Preserving Data Mining 291–312. Springer, Berlin.
[25] Fraser, B. and Wooton, J. (2005). A proposed method for confidentialising tabular output to protect against differencing. In Joint UNECE/Eurostat Conference on Statistical Disclosure Control, Geneva, Switzerland, 9–11 November. Available at https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2005/wp.35.e.pdf.
[26] Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. J. Off. Stat.9 383–383.
[27] Gaboardi, M., Arias, E. J. G., Hsu, J., Roth, A. and Wu, Z. S. (2016). Dual query: Practical private query release for high dimensional data. J. Priv. Confid.7 53–77.
[28] Geng, Q. and Viswanath, P. (2016). The optimal noise-adding mechanism in differential privacy. IEEE Trans. Inform. Theory62 925–951. · Zbl 1359.94596
[29] Ghosh, A., Roughgarden, T. and Sundararajan, M. (2012). Universally utility-maximizing privacy mechanisms. SIAM J. Comput.41 1673–1693. · Zbl 1271.68102
[30] Gomatam, S. and Karr, A. (2003). Distortion measures for categorical data swapping. Technical report, National Institute of Statistical Sciences. Available at www.niss.org/downloadabletechreports.html.
[31] Gotz, M., Machanavajjhala, A., Wang, G., Xiao, X. and Gehrke, J. (2012). Publishing search logs—a comparative study of privacy guarantees. IEEE Trans. Knowl. Data Eng.24 520–532.
[32] Gymrek, M., McGuire, A. L., Golan, D., Halperin, E. and Erlich, Y. (2013). Identifying personal genomes by surname inference. Science339 321–324.
[33] Hardt, M., Ligett, K. and McSherry, F. (2012). A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems 2339–2347.
[34] Hay, M., Rastogi, V., Miklau, G. and Suciu, D. (2010). Boosting the accuracy of differentially private histograms through consistency. Proc. VLDB Endow.3 1021–1032.
[35] Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y. and Zhang, D. (2016). Principled evaluation of differentially private algorithms using DPBench. In Proceedings of the 2016 International Conference on Management of Data 139–154 ACM, New York.
[36] Homer, N., Szelinger, S., Redman, M., Duggan, D., Tembe, W., Muehling, J., Pearson, J. V., Stephan, D. A., Nelson, S. F. and Craig, D. W. (2008). Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet.4 e1000167.
[37] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K. and de Wolf, P. P. (2012). Statistical Disclosure Control. Wiley, Chichester.
[38] Kairouz, P., Oh, S. and Viswanath, P. (2017). The composition theorem for differential privacy. IEEE Trans. Inform. Theory63 4037–4049. · Zbl 1401.94185
[39] Karr, A. F., Kohnen, C. N., Oganian, A., Reiter, J. P. and Sanil, A. P. (2006). A framework for evaluating the utility of data altered to protect confidentiality. Amer. Statist.60 224–232.
[40] Karwa, V., Kifer, D. and Slavković, A. B. (2015). Private posterior distributions from variational approximations. Preprint. Available at arXiv:1511.07896.
[41] Karwa, V., Slavković, A. et al. (2016). Inference using noisy degrees: Differentially private \(β\)-model and synthetic graphs. Ann. Statist.44 87–112. · Zbl 1331.62114
[42] Li, C., Miklau, G., Hay, M., McGregor, A. and Rastogi, V. (2015). The matrix mechanism: Optimizing linear counting queries under differential privacy. VLDB J.24 757–781.
[43] Little, R. (1993). Statistical analysis of masked data. J. Off. Stat.9 407–426.
[44] Liu, F. (2017). Generalized gaussian mechanism for differential privacy. Preprint. Available at arXiv:1602.06028v5.
[45] Longhurst, J., Tromans, N., Young, C. and Miller, C. (2007). Statistical disclosure control for the 2011 UK census. In Joint UNECE/Eurostat conference on Statistical Disclosure Control, Manchester, 17–19 December. Available at http://ec.europa.eu/eurostat/documents/1001617/4569122/TOPIC-3-WP-28-IP-LONGHURST-ET-ALREV.pdf.
[46] Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J. and Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In Proceedings of the IEEE 24th International Conference on Data Engineering ICDE 277–286.
[47] Marley, J. K. and Leaver, V. L. (2011). A method for confidentialising user-defined tables: Statistical properties and a risk-utility analysis. In Proc. 58th Congress of the International Statistical Institute, ISI 2011 21–26.
[48] McSherry, F. and Mironov, I. (2009). Differentially private recommender systems: Building privacy into the net. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 627–636. ACM, New York.
[49] McSherry, F. and Talwar, K. (2007). Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on 94–103. IEEE, New York. · Zbl 1232.68047
[50] Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large datasets. In Proc IEEE Security & Privacy Conference 111–125.
[51] O’Keefe, C. M. and Chipperfield, J. O. (2013). A summary of attack methods and protective measures for fully automated remote analysis systems. Int. Stat. Rev.81 426–455.
[52] Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. J. Off. Stat.9 462–468.
[53] Shannon, C. E. (1949). Communication theory of secrecy systems. Bell Syst. Tech. J.28 656–715. · Zbl 1200.94005
[54] Shlomo, N. (2007). Statistical disclosure control methods for census frequency tables. Int. Stat. Rev.75 199–217.
[55] Shlomo, N., Antal, L. and Elliot, M. (2015). Measuring disclosure risk and data utility for flexible table generators. J. Off. Stat.31 305–324.
[56] Shlomo, N. and Young, C. (2008). Invariant post-tabular protection of census frequency counts. In PSD’2008 Privacy in Statistical Databases (J. Domingo-Ferrer and Y. Saygin, eds.). LNCS5261 77–89. Springer, Berlin.
[57] Steinke, T. and Ullman, J. (2016). Between pure and approximate differential privacy. J. Priv. Confid.7 3–22.
[58] Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. J. Law Med. Ethics25 98–110.
[59] Thompson, G., Broadfood, S. and Elazar, D. (2013). Methodology for automatic confidentialisation of statistical outputs from remote servers at the Australian Bureau of Statistics. In Joint UNECE/Eurostat conference on Statistical Disclosure Control, Ottawa, 28–30 October. Available at https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.46/2013/Topic_1_ABS.pdf.
[60] Uhler, C., Slavković, A. and Fienberg, S. E. (2013). Privacy-preserving data sharing for genome-wide association studies. J. Priv. Confid.5 137–166.
[61] van den Hout, A. and van der Heijden, P. G. M. (2002). Randomized response, statistical disclosure control and misclassification: A review. Int. Stat. Rev.70 269–288. · Zbl 1217.62011
[62] Wang, Y., Lee, J. and Kifer, D. (2017). Revisiting differentially private hypothesis tests for categorical data. Preprint. Available at arXiv:1511.03376v4.
[63] Warner, S. L. (1965). Randomized response: A survey technique for eliminating evasive answer bias. J. Amer. Statist. Assoc.60 63–69. · Zbl 1298.62024
[64] Wasserman, L. and Zhou, S. (2010). A statistical framework for differential privacy. J. Amer. Statist. Assoc.105 375–389. · Zbl 1364.62011
[65] Willenborg, L. · Zbl 0973.62009
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.