×

A firm foundation for statistical disclosure control. (English) Zbl 1466.62440

Summary: The present article reviews the theory of data privacy and confidentiality in statistics and computer science, to modernize the theory of anonymization. This effort results in the mathematical definitions of identity disclosure and attribute disclosure applicable to even synthetic data. Also differential privacy is clarified as a method to bound the accuracy of population inference. This bound is derived by the Hammersley-Chapman-Robbins inequality, and it leads to the intuitive selection of the privacy budget \(\epsilon\) of differential privacy.

MSC:

62P25 Applications of statistics to social sciences
62A01 Foundations and philosophical topics in statistics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Abowd, JM; Vilhuber, L.; Domingo-Ferrer; Saygun, How protective are synthetic data?, Privacy in statistical databases. Lecture notes in computer science, 239-246 (2008), New York: Springer, New York
[2] Aggarwal, CC; Yu, PS; Bertino, E., A condensation approach to privacy preserving data mining, Advances in database technology—EDBT, lecture notes in computer science, 183-199 (2004), Berlin: Springer, Berlin
[3] Aggarwal, CC; Yu, PS, Privacy-preserving data mining: models and algorithms (2008), New York: Springer, New York
[4] Agrawal, R., & Srikant, R. (2000). Privacy preserving data mining. In Proceedings of ACM International Conference on Management of Data (SIGMOD) (pp. 439-450).
[5] Anderson, MJ; Seltzer, W., Federal statistical confidentiality and business data: Twentieth century challenges and continuing issues, Journal of Privacy and Confidentiality, 1, 7-52 (2009)
[6] Baayen, RH, Word frequency distributions (2001), Dordrecht: Kluwer, Dordrecht · Zbl 0989.68146
[7] Bambauer, J.; Muralidhar, K.; Sarathy, R., Fool’s gold: an illustrated critique of differential privacy, Vanderbilt Journal of Entertainment and Technology Law, 16, 701-755 (2013)
[8] Barbaro, M., & Zeller, T. (2006). A Face is exposed for AOL searcher no. 4417749, The New York Times.
[9] Beckman, RJ; Baggerly, KA; McKay, MD, Creating synthetic baseline populations, Transportation Research, Part A: Policy and Practice, 30, 415-429 (1996)
[10] Benedetto, G., Stanley, J.C., & Totty, E. (2018) The creation and use of the SIPP synthetic Beta v7.0, U.S. Census Bureau.
[11] Bethlehem, JG; Keller, WJ; Pannekoek, J., Disclosure control of microdata, Journal of the American Statistical Association, 85, 38-45 (1990)
[12] Birnbaum, A., On the foundation of statistical inference, Journal of the American Statistical Association, 57, 269-306 (1962) · Zbl 0107.36505
[13] Bishop, YMM; Fienberg, SE; Holland, PW, Discrete multivariate analysis: Theory and practice (1975), Cambridge: MIT Press, Cambridge · Zbl 0332.62039
[14] Bowen, CM; Liu, F., Comparative study of differentially private data synthesis methods, Statistical Science, 35, 280-307 (2020) · Zbl 07292514
[15] Brand, R. (2002). Microdata protection through noise addition. In Domingo-Ferrer (Ed.), Inference control in statistical databases: From theory to practice, lecture notes in computer science (Vol. 2316, pp. 97-116). Berlin: Springer. · Zbl 0992.68514
[16] Brandt, M.; Lenz, R.; Rosemann, M.; Domingo-Ferrer, Anonymisation of panel enterprise microdata—Survey of a German project, Privacy in statistical databases, lecture notes in computer science, 139-151 (2008), Berlin: Springer, Berlin
[17] Butz, W.; Torrey, B., Some frontiers in social science, Science, 312, 1898-1900 (2006)
[18] Chapman, DG; Robbins, H., Minimum variance estimation without regularity assumptions, The Annals of Mathematical Statistics, 22, 581-586 (1951) · Zbl 0044.34302
[19] Chaudhuri, K., & Mishra, N. (2006). When random sampling preserves privacy. In Proceedings of the 26th Annual International Conference on Advances in Cryptology (CRYPTO 2006) (pp. 198-213). Berlin:Springer. · Zbl 1161.94438
[20] Clifton, C.; Tassa, T., On syntactic anonymity and differential privacy, Transactions on Data Privacy, 6, 161-183 (2013)
[21] Dalenius, T., Finding a needle in a haystack – or identifying anonymous census records, Journal of Official Statistics, 2, 329-336 (1986)
[22] Danker, FK; El Eman, K., Practicing differential privacy in health care: A review, Transactions on Data Privacy, 5, 35-67 (2013)
[23] Deming, WE; Stephan, FF, On a least squares adjustment of a sampled frequency table when the expected marginal totals are known, The Annals of Mathematical Statistics, 11, 427-444 (1940) · Zbl 0024.05502
[24] Deng, M.; Wuyts, K.; Scandariato, R.; Preneel, B.; Joosen, W., A privacy threat analysis framework: Supporting the elicitation and fulfillment of privacy requirements, Requirements Engineering, 16, 3-32 (2011)
[25] Dennis, JC, Privacy and confidentiality of health information (2000), San Francisco: Jossey-Bass, San Francisco
[26] Dinur, I., & Nissim, K. (2003). Revealing information while preserving privacy. In Proceedings of the Twenty-second ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (pp. 202-210).
[27] Domingo-Ferrer, J.; Tora, V., Privacy in statistical databases, lecture notes in computer science (2004), Berlin: Springer, Berlin
[28] D’Orazio, M.; Di Zio, M.; Scanu, M., Statistical matching: Theory and practice (2006), Chichester: Wiley, Chichester · Zbl 1107.62008
[29] Doyle, P.; Lane, JI; Theeuwes, JJM; Zayatz, LV, Confidentiality, disclosure, and data access (2001), Amsterdam: Elsevier, Amsterdam
[30] Drechsler, J., Synthetic datasets for statistical disclosure control: Theory and implementation, lecture notes in statistics (2011), New York: Springer, New York · Zbl 1279.62015
[31] Duncan, GT; Elliot, M.; Salazar-González, JJ, Statistical confidentiality (2011), New York: Springer, New York · Zbl 1233.62204
[32] Dwork, C. (2006). Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming-ICALP 2006, Part II, Lecture Notes in Computer Science (Vol. 4052, pp. 1-12). Springer. · Zbl 1133.68330
[33] Dwork, C., A firm foundation for private data analysis, Communications of the ACM, 54, 86-95 (2011)
[34] Dwork, C.; Kenthapadi, K.; McSherry, F.; Mironov, I.; Naor, M.; Vaudenay, S., Our data, ourselves: privacy via distributed noise generation, Advances in cryptology - EUROCRYPT 2006, 486-503 (2006), Berlin: Springer, Berlin · Zbl 1140.94336
[35] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In TCC 2006-theory of cryptography conference (pp. 265-284). · Zbl 1112.94027
[36] Dwork, C.; Smith, A.; Steinke, T.; Ulllman, J., Exposed! A survey of attacks on private data, Annual Review of Statistics and Its Application, 4, 61-84 (2017)
[37] Efron, B., Bootstrap methods: Another look at the Jackknife, Annals of Statistics, 7, 1-26 (1979) · Zbl 0406.62024
[38] El Emam, K.; Arbuckle, L., Anonymizing health data (2013), Sebastopol: O’Reilly, Sebastopol
[39] Erlingsson, U., Pihur, V., & Korolova, A. (2014). RAPPOR: randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 21st ACM Conference on Computer and Communications Security, ACM, Scottsdale, Arizona.
[40] Evett, I.; Jackson, G.; Lambert, JA; McCrossan, S., The impact of the principles of evidence interpretation on the structure and content of statements, Science & Justice, 40, 233-239 (2000)
[41] Fienberg, S. E. (1994). A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Technical report, Department of Statistics, Carnegie Mellon University.
[42] Fienberg, SE; Kempf-Leonard, Confidentiality and disclosure limitation, Encyclopedia of social measurement, 463-469 (2005), New York: Elsevier, New York
[43] Fienberg, SE; Holland, PW, Simultaneous estimation of multinomial cell probabilities, Journal of the American Statistical Association, 68, 683-691 (1973) · Zbl 0267.62030
[44] Fienberg, SE; Makov, UE; Steele, RJ, Disclosure limitation using perturbation and related methods for categorical data, Journal of Official Statistics, 14, 485-502 (1998)
[45] Fung, BCM; Wang, K.; Fu, AWC; Yu, PS, Introduction to privacy-preserving data publishing (2010), Boca Raton: Chapman and Hall/CRC, Boca Raton
[46] Ghosh, A.; Roughgarden, T.; Sundararajan, M., Universally utility-maximizing privacy mechanism, SIAM Journal of Computing, 41, 1673-1693 (2012) · Zbl 1271.68102
[47] Giessing, S.; Domingo-Ferrer; Torra, Survey on methods for tabular data protection in ARGUS, Privacy in statistical databases, lecture notes in computer science, 1-13 (2004), Berlin: Springer, Berlin
[48] Godambe, VP, A unified theory of sampling from finite populations, Journal of the Royal Statistical Society, B, 17, 268-278 (1955) · Zbl 0067.11406
[49] Goel, V. (2014). How Facebook sold you krill oil, The New York Times.
[50] Good, IJ, The population frequencies of species and the estimation of population parameters, Biometrika, 40, 237-264 (1953) · Zbl 0051.37103
[51] Gottschalk, S., Microdata disclosure by resampling - Empirical findings for business survey data, Allgemeines Statistisches Archiv, 88, 279-302 (2004) · Zbl 1124.62319
[52] Hammersley, JM, On estimating restricted parameters, The Journal of the Royal Statistical Society, Series B, 12, 192-240 (1950) · Zbl 0040.22202
[53] Heard, D.; Dent, G.; Schifeling, T.; Banks, D., Agent-based models and microsimulation, Annual Review of Statistics and its Application, 2, 259-272 (2015)
[54] Horvitz, DG; Thompson, DJ, A generalization of sampling without replacement from a finite universe, Journal of the American Statistical Association, 47, 663-685 (1952) · Zbl 0047.38301
[55] Hoshino, N., The quasi-multinomial distribution as a tool for disclosure risk assessment, Journal of Official Statistics, 25, 269-291 (2009)
[56] Hoshino, N., Evidence based anonymization, Journal of the Japan Statistical Society, Series J, 46, 1-42 (2016) · Zbl 07387533
[57] Hoshino, N. (2018). The control of statistical inference. In Talk at computer security symposium 2018, October 24. (In Japanese.).
[58] Hundepool, A.; Domingo-Ferrer, J.; Franconi, L.; Giessing, S.; Nordholt, ES; Spicer, K.; de Wolf, PP, Statistical disclosure control (2012), West Sussex: Wiley, West Sussex
[59] Inusah, S.; Kozubowski, TJ, A discrete analogue of the Laplace distribution, Journal of Statistical Planning and Inference, 136, 1090-1102 (2006) · Zbl 1081.60011
[60] Jeffreys, H., Some tests of significance, treated by the theory of probability, Mathematical Proceedings of the Cambridge Philosophical Society, 31, 203-222 (1935) · Zbl 0011.31601
[61] Jeffreys, H., Theory of probability (1961), Oxford: Oxford University Press, Oxford · Zbl 0116.34904
[62] Kasivisiwanathan, SP; Smith, A., On the semantics of differential privacy: A Bayesian formulation, Journal of Privacy and Confidentiality, 6, 1-16 (2014)
[63] Kass, RE; Raftery, AE, Bayes factors, Journal of the American Statistical Association, 90, 773-795 (1995) · Zbl 0846.62028
[64] Khmaladze, E. V. (1987). The statistical analysis of a large number of rare events. In Technical Report Report MS-R8804, Department of Mathematical Statistics, CWI. Amsterdam: Center for Mathematics and Computer Science.
[65] Kifer, D., & Machanavajjhala, A. (2011). No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD ’11) (pp. 193-204). Association for Computing Machinery, New York, NY, USA.
[66] Kifer, D., & Machanavajjhala, A. (2014). Pufferfish: A framework for mathematical privacy definitions. ACM Transactions on Database Systems, 39, [a3]. doi:10.1145/2514689 · Zbl 1321.94067
[67] Kotz, S.; Kozubowski, T.; Podgórski, K., The laplace distribution and generalizations: A revisit with applications to communications, economics, engineering, and finance (2001), Boston: Birkhäuser, Boston · Zbl 0977.62003
[68] Lee, J., & Clifton, C. (2011). How much is enough? Choosing \(\epsilon\) for differential privacy. In Lai et al. (Eds.) ISC 2011, Lecture Notes in Computer Science (Vol. 7001, pp. 325-340).
[69] Lehmann, EL; Casella, G., Theory of point estimation (1998), New York: Springer, New York · Zbl 0916.62017
[70] Li, N., Li, T., & Venkatasubramanian, S. (2007). \(t\)-Closeness: Privacy beyond \(k\)-anonymity and \(\ell \)-diversity. In IEEE 23rd International Conference on Data Engineering (ICDE) (pp. 106-115).
[71] Lindell, Y., & Pinkas, B. (2000). Privacy preserving data mining. In Mihir Bellare (Ed.) Proceedings of the 20th Annual International Cryptology Conference on Advances in Cryptology (CRYPTO ’00) (pp. 36-54). London: Springer. · Zbl 0989.68506
[72] Little, R., Statistical analysis of masked data, Journal of Official Statistics, 9, 407-426 (1993)
[73] Liu, C.; He, X.; Chanyaswad, T.; Wang, S.; Mittal, P., Investigating statistical privacy frameworks from the perspective of hypothesis testing, Proceedings on Privacy Enhancing Technologies, 2019, 3, 233-254 (2019)
[74] Lowrance, WW, Privacy, confidentiality, and health research (2012), New York: Cambridge University Press, New York
[75] Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., & Vilhuber, L. (2008). Privacy: Theory meets practice on the map. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, ICDE’08 (pp. 277-286).
[76] Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam. (2007). \( \ell \)-diversity: privacy beyond \(k\)-anonymity. ACM Transactions on Knowledge Discovery from Data, 1(1), Article 3.
[77] Marsh, C.; Skinner, C.; Arber, S.; Penhale, P.; Openshaw, S.; Hobcraft, J.; Lievesley, D.; Walford, N., The case for a sample of anonymized records from the 1991 census, Journal of the Royal Statistical Society, Series A, 154, 305-340 (1991)
[78] Meiser, S., Approximate and probabilistic differential privacy definitions, IACR Cryptology ePrint Archive, 2018, 277 (2018)
[79] Mendes, R.; Vilela, JP, Privacy-preserving data mining: Methods, Metrics, and Applications IEEE Access, 5, 10562-10582 (2017)
[80] Muralidhar, K.; Saraty, R.; Li, H., Secure attribute sharing of linked microdata, Decision Support Systems, 81, 20-29 (2016)
[81] Nakamura, H., Microdata access for official statistics in Japan, Sociological Theory and Methods, 32, 310-320 (2017)
[82] National Research Council, Putting people on the map: Protecting confidentiality with linked social-spatial data (2007), Washington: The National Academies Press, Washington
[83] Neyman, J.; Pearson, ES, On the problem of the most efficient tests of statistical hypotheses, Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231, 289-337 (1933) · JFM 59.1163.02
[84] Nin, J.; Herranz, J., Privacy and anonymity in information management systems (2010), London: Springer, London
[85] Nissim, K., Raskhodnikova, S., & Smith, A. (2007). Smooth sensitivity and sampling in private data analysis. In Proceedings of the Annual ACM Symposium on Theory of Computing (pp. 75-84). · Zbl 1232.68039
[86] O’Keefe, CM; Camenisch, J.; Fischer-Hubner, S.; Hansen, M., Privacy and confidentiality in service science and big data analytics, Privacy and identity management for the future internet in the age of globalisation, privacy and identity 2014. IFIP advances in information and communication technology, 54-70 (2015), Cham: Springer, Cham
[87] Pawitan, Y., In all likelihood (2001), Oxford: Clarendon Press, Oxford · Zbl 1013.62001
[88] Pfitzmann, A., & Hansen, M. (2010). A terminology for talking about privacy by data minimization: anonymity, unlinkability, undetectability, unobservability, pseudonymity, and identity management. Version 0.34 August 2010, Technical Report, TU Dresden and ULD Kiel. http://dud.inf.tu-dresden.de/Anon_Terminology.shtml
[89] President’s Council of Advisors on Science and Technology, Report to the president: Big data and privacy: A technological perspective (2014), Washington: Executive Office of the President, Washington
[90] Quatember, A., Pseudo-populations (2015), Cham: Springer, Cham · Zbl 1347.62009
[91] Raab, GM; Nowok, B.; Dibben, C., Practical data synthesis for large samples, Journal of Privacy and Confidentiality, 7, 67-97 (2017)
[92] Reiter, JP, Differential privacy and federal data releases, Annual Review of Statistics and its Application, 6, 85-101 (2019)
[93] Rinott, Y.; O’Keefe, CM; Shlomo, N.; Skinner, C., Confidentiality and differential privacy in the dissemination of frequency tables, Statistical Sciences, 33, 358-385 (2018) · Zbl 1403.62229
[94] Ritchie, F. (2017). The “Five Safes”: A framework for planning, designing and evaluating data access solutions. Paper presented at Data for Policy 2017, London, UK.
[95] Ritchie, F., Secure access to confidential microdata: Four years of the Virtual Microdata Laboratory, Economic and Labour Market Review, 2, 29-34 (2008)
[96] Rocher, L.; Hendrickx, JM; de Montjoye, Y., Estimating the success of re-identifications in incomplete datasets using generative models, Nature Communications, 10, 3069 (2019)
[97] Rubin, DB, Multiple imputation for nonresponse in surveys (1987), New York: Wiley, New York · Zbl 1070.62007
[98] Rubin, DB, Discussion: Statistical disclosure limitation, Journal of Official Statistics, 9, 462-468 (1993)
[99] Ruggles, S.; Fitch, CA; Magnuson, DL; Schroeder, JP, Differential privacy and census data: Implications for social and economic research, AEA Papers and Proceedings, 109, 403-408 (2019)
[100] Shlomo, N.; Skinner, CJ, Privacy protection from sampling and perturbation in survey microdata, Journal of Privacy and Confidentiality, 4, 155-169 (2012)
[101] Shlosser, A., On estimation of the size of the dictionary of a long text on the basis of a sample, Engineering Cybernetics, 19, 97-102 (1981) · Zbl 0507.62007
[102] Singer, E.; Van Hoewyk, J.; Neugebauer, RJ, Attitudes and behavior: the impact of privacy and confidentiality concerns on participation in the 2000 Census, Public Opinion Quarterly, 67, 368-384 (2003)
[103] Singer, E.; Mathiowetz, NA; Couper, MP, The impact of privacy and confidentiality concerns on survey participation: the case of the 1990 U.S. Ceusus, Public Opinion Quarterly, 57, 465-482 (1993)
[104] Smith, A. (2008). Efficient, differentially private point estimators. arXiv:0809.4794.
[105] Snoke, J.; Raab, G.; Nowok, B.; Dibben, C.; Slavkovic, A., General and specific utility measures for synthetic data, Journal of the Royal Statistical Society, Series A, 181, 663-688 (2018)
[106] Solove, DJ, Understanding privacy (2008), Cambridge: Harvard University Press, Cambridge
[107] Solove, DJ, Privacy self-management and the consent dilemma, Harvard Law Review, 126, 1880-1903 (2013)
[108] Soria-Comas, J.; Domingo-Ferrer, J.; Sanchez, D.; Megias, D., Individual differential privacy: A utility-preserving formulation of differential privacy guarantees, IEEE Transactions on Information Forensics and Security, 12, 1418-1429 (2017)
[109] Stewart, KA; Segars, AH, An empirical examination of the concern for information privacy instrument, Information Systems Research, 13, 36-49 (2002)
[110] Sweeney, L. (2000). Uniqueness of Simple Demographics in the U.S. Population, LIDAPWP4. Carnegie Mellon University, Laboratory for International Data Privacy, Pittsburgh.
[111] Sweeney, L., \(k\)-Anonymity: A model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-based Systems, 10, 557-570 (2002) · Zbl 1085.68589
[112] Tang, J., Korolova, A., Bai, X., Wang, X., & Wang, X. (2017). Privacy loss in Apple’s implementation of differential privacy on MacOS 10.12. arXiv:1709.02753 [cs.CR]
[113] Templ, M., Statistical disclosure control for microdata (2017), Cham: Springer, Cham · Zbl 1437.62006
[114] Templ, M.; Meindl, B.; Kowarik, A.; Dupriez, O., Simulation of synthetic complex data: The R package simPop, Journal of Statistical Software, 79, 1-38 (2017)
[115] Tukey, JW, Exploratory data analysis (1977), Boston: Addison-Wesley, Boston · Zbl 0409.62003
[116] Warner, SL, Randomized response: A survey technique for eliminating evasive answer bias, Journal of the American Statistical Association, 60, 63-69 (1965) · Zbl 1298.62024
[117] Warner, SL, The linear randomized response model, Journal of the American Statistical Association, 66, 884-888 (1971)
[118] Wasserman, L.; Zhou, S., A statistical framework for differential privacy, Journal of the American Statistical Association, 105, 375-389 (2010) · Zbl 1364.62011
[119] Wilks, SS, The large-sample distribution of the Likelihood ratio for testing composite hypotheses, The Annals of Mathematical Statistics, 9, 60-62 (1938) · Zbl 0018.32003
[120] Willenborg, L.; de Waal, T., Statistical disclosure control in practice, lecture notes in statistics (1996), New York: Springer, New York · Zbl 0853.62096
[121] Willenborg, L.; de Waal, T., Elements of statistical disclosure control. Lecture notes in statistics (2000), New York: Springer, New York · Zbl 0853.62096
[122] Zhu, T.; Li, G.; Zhou, W.; Yu, PS, Differential privacy and applications (2017), Cham: Springer, Cham
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.