Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access analysis servers. (English) Zbl 1088.62142

Summary: Given the public’s ever-increasing concerns about data confidentiality, in the near future statistical agencies may be unable or unwilling, or even may not be legally allowed, to release any genuine microdata – data on individual units, such as individuals or establishments. In such a world, an alternative dissemination strategy is remote access analysis servers, to which users submit requests for output from statistical models fit using the data, but are not allowed access to the data themselves. Analysis servers, however, are not free from the risk of disclosure, especially in the face of multiple, interacting queries. We describe these risks and propose quantifiable measures of risk and data utility that can be used to specify which queries can be answered and with what output. The risk-utility framework is illustrated for regression models.


62P99 Applications of statistics
62J05 Linear regression; mixed models
68U99 Computing methodologies and applications
Full Text: DOI


[1] Dalenius, T. and Reiss, S. P. (1982). Data-swapping: A technique for disclosure control. J. Statist. Plann. Inference 6 73–85. · Zbl 0475.68060
[2] Dobra, A., Karr, A. F. and Sanil, A. P. (2003). Preserving confidentiality of high-dimensional tabulated data: Statistical and computational issues. Statist. Comput. 13 363–370.
[3] Dobra, A., Karr, A. F., Sanil, A. P. and Fienberg, S. E. (2002). Software systems for tabular data releases. International J. Uncertainty, Fuzziness and Knowledge-Based Systems 10 529–544. · Zbl 1085.68579
[4] Domingo-Ferrer, J., Mateo-Sanz, J. M. and Torra, V. (2001). Comparing SDC methods for microdata on the basis of information loss and disclosure risk. Presented at UNECE Workshop on Statistical Data Editing.
[5] Duncan, G. T., de Wolf, V. A., Jabine, T. B. and Straf, M. L. (1993). Report of the panel on confidentiality and data access. J. Official Statistics 9 271–274.
[6] Duncan, G. T., Keller-McNulty, S. A. and Stokes, S. L. (2002). Disclosure risk vs. data utility: The R–U confidentiality map. Unpublished manuscript.
[7] Duncan, G. T. and Lambert, D. (1986). Disclosure-limited data dissemination (with discussion). J. Amer. Statist. Assoc. 81 10–28.
[8] Duncan, G. T. and Mukherjee, S. (2000). Optimal disclosure limitation strategy in statistical databases: Deterring tracker attacks through additive noise. J. Amer. Statist. Assoc. 95 720–729.
[9] Federal Committee on Statistical Methodology (1994). Report on Statistical Disclosure Limitation Methodology . U.S. Office of Management and Budget, Washington.
[10] Fienberg, S. E., Makov, U. E. and Sanil, A. P. (1997). A Bayesian approach to data disclosure: Optimal intruder behavior for continuous data. J. Official Statistics 13 75–89.
[11] Fienberg, S. E., Makov, U. E. and Steele, R. J. (1998). Disclosure limitation using perturbation and related methods for categorical data. J. Official Statistics 14 485–502.
[12] Fienberg, S. E., Steele, R. J. and Makov, U. E. (1996). Statistical notions of data disclosure avoidance and their relationship to traditional statistical methodology: Data swapping and log-linear models. In Proc. Bureau of Census 1996 Annual Research Conference 87–105. U. S. Bureau ot the Census, Washington.
[13] Fuller, W. A. (1993). Masking procedures for microdata disclosure limitation. J. Official Statistics 9 383–406.
[14] Gomatam, S., Karr, A. F. and Sanil, A. P. (2003). Data swapping as a decision problem. Unpublished manuscript.
[15] Guttman, I. (1982). Linear Models : An Introduction . Wiley, New York. · Zbl 0567.62055
[16] Karr, A. F., Dobra, A. and Sanil, A. P. (2003). Table servers protect confidentiality in tabular data releases. Comm. ACM 46 (1) 57–58.
[17] Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2004). Analysis of integrated data without data integration. Chance 17 (3) 26–29.
[18] Karr, A. F., Lin, X., Sanil, A. P. and Reiter, J. P. (2005). Secure regression on distributed databases. J. Comput. Graph. Statist. 14 263–279. Available on-line at www.niss.org/dgii/technicalreports.html.
[19] Keller-McNulty, S. and Unger, E. A. (1998). A database system prototype for remote access to information based on confidential data. J. Official Statistics 14 347–360.
[20] Lambert, D. (1993). Measures of disclosure risk and harm. J. Official Statistics 9 313–331.
[21] Little, R. J. A. (1993). Statistical analysis of masked data. J. Official Statistics 9 407–426.
[22] Paass, G. (1988). Disclosure risk and disclosure avoidance for microdata. J. Bus. Econom. Statist. 6 487–500.
[23] Palley, M. A. and Simonoff, J. S. (1987). The use of regression methodology for the compromise of confidential information in statistical databases. ACM Trans. Database Systems 12 593–608.
[24] Raghunathan, T. E., Reiter, J. P. and Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. J. Official Statistics 19 1–16.
[25] Reiter, J. P. (2002). Satisfying disclosure restrictions with synthetic data sets. J. Official Statistics 18 531–544.
[26] Reiter, J. P. (2003a). Inference for partially synthetic, public use microdata sets. Survey Methodology 29 81–88.
[27] Reiter, J. P. (2003b). Model diagnostics for remote access regression servers. Stat. Comput. 13 371–380.
[28] Reiter, J. P. (2005). Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. J. Roy. Statist. Soc. Ser. A 168 185–205. · Zbl 1099.62138
[29] Reiter, J. P. and Kohnen, C. (2005). Categorical data regression diagnostics for remote servers. J. Statist. Comput. Simulation . · Zbl 1077.62130
[30] Rowland, S. (2003). An examination of monitored, remote microdata access systems. Presented at NAS Workshop on Access to Research Data: Assessing Risks and Opportunities, October 16–17, 2003.
[31] Rubin, D. B. (1993). Discussion: Statistical disclosure limitation. J. Official Statistics 9 461–468. JSTOR: · Zbl 1002.62503
[32] Sanil, A. P., Karr, A. F., Lin, X. and Reiter, J. P. (2004a). Privacy preserving analysis of vertically partitioned data using secure matrix products. Available on-line at www.niss.org/dgii/technicalreports.html.
[33] Sanil, A. P., Karr, A. F., Lin, X. and Reiter, J. P. (2004b). Privacy preserving regression modelling via distributed computation. In Proc. 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 677–682. ACM Press, New York. Available on-line at www.niss.org/dgii/technicalreports.html.
[34] Schouten, B. and Cigrang, M. (2003). Remote access systems for statistical analysis of microdata. Stat. Comput. 13 381–389.
[35] Sweeney, L. (1997). Computational disclosure control for medical microdata: The Datafly system. In Record Linkage Techniques—1997 : Proc. of an International Workshop and Exposition (W. Alvey and B. Jamerson, eds.) 442–453. U.S. Office of Management and Budget, Washington.
[36] Todd, M. J. (2001). Semidefinite optimization. Acta Numer. 10 515–560. · Zbl 1105.65334
[37] Vandenberghe, L. and Boyd, S. (1996). Semidefinite programming. SIAM Rev. 38 49–95. JSTOR: · Zbl 0845.65023
[38] Willenborg, L. C. R. J. and de Waal, T. (1996). Statistical Disclosure Control in Practice . Springer, New York. · Zbl 0853.62096
[39] Willenborg, L. C. R. J. and de Waal, T. (2001). Elements of Statistical Disclosure Control . Springer, New York. · Zbl 0973.62009
[40] Yancey, W. E., Winkler, W. E. and Creecy, R. H. (2002). Disclosure risk assessment in perturbative microdata protection. In Inference Control in Statistical Databases, from Theory to Practice. Lecture Notes in Comput. Sci. 2316 135–152. Springer, London. · Zbl 1051.68857
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.