zbMATH — the first resource for mathematics

Simulation of close-to-reality population data for household surveys with application to EU-SILC. (English) Zbl 1237.91178
Summary: Statistical simulation in survey statistics is usually based on repeatedly drawing samples from population data. Furthermore, population data may be used in courses on survey statistics to explain issues regarding, e.g., sampling designs. Since the availability of real population data is in general very limited, it is necessary to generate synthetic data for such applications. The simulated data need to be as realistic as possible, while at the same time ensuring data confidentiality. This paper proposes a method for generating close-to-reality population data for complex household surveys. The procedure consists of four steps for setting up the household structure, simulating categorical variables, simulating continuous variables and splitting continuous variables into different components. It is not required to perform all four steps so that the framework is applicable to a broad class of surveys. In addition, the proposed method is evaluated in an application to the European Union Statistics on Income and Living Conditions (EU-SILC).

91B82 Statistical methods; economic indices and measures
62P20 Applications of statistics to economics
PDF BibTeX Cite
Full Text: DOI
[1] Alfons A (2010) $${\(\backslash\)tt{simFrame}}$$ : simulation framework. R package version 0.3.7
[2] Alfons A, Kraft S (2010) $${\(\backslash\)tt{simPopulation}}$$ : simulation of synthetic populations for surveys based on sample data. R package version 0.2.1
[3] Alfons A, Templ M, Filzmoser P (2010a) An object-oriented framework for statistical simulation: the R package $${\(\backslash\)tt{simFrame}}$$ . J Stat Softw 37(3): 1–36
[4] Alfons A, Templ M, Filzmoser P (2010b) Simulation of EU-SILC population data: using the R package $${\(\backslash\)tt{simPopulation}}$$ . Research Report CS-2010-5, Department of Statistics and Probability Theory, Vienna University of Technology · Zbl 1237.91178
[5] Atkinson T, Cantillon B, Marlier E, Nolan B (2002) Social indicators: the EU and social inclusion. Oxford University Press, New York ISBN 0-19-925349-8
[6] Clarke G (1996) Microsimulation: an introduction. In: Clarke G (ed) Microsimulation for urban and regional policy analysis. Pion, London
[7] Drechsler J, Bender S, Rässler S (2008) Comparing fully and partially synthetic datasets for statistical disclosure control in the German IAB Establishment Panel. Trans Data Priv 1(3): 105–130
[8] Embrechts P, Klüppelberg G, Mikosch T (1997) Modelling extremal events for insurance and finance. Springer, New York ISBN 3-540-60931-8 · Zbl 0873.62116
[9] Eurostat (2004) Description of target variables: cross-sectional and longitudinal. EU-SILC 065/04, Eurostat, Luxembourg
[10] Horvitz D, Thompson D (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47(260): 663–685 · Zbl 0047.38301
[11] Kendall M, Stuart A (1967) The advanced theory of statistics, vol 2, 2nd edn. Charles Griffin & Co. Ltd, London · Zbl 0416.62001
[12] Kleiber C, Kotz S (2003) Statistical size distributions in economics and actuarial sciences. Wiley, Hoboken ISBN 0-471-15064-9 · Zbl 1044.62014
[13] Kraft S (2009) Simulation of a population for the European living and income conditions survey. Master’s thesis, Vienna University of Technology
[14] Meyer D, Zeileis A, Hornik K (2006) The $${\(\backslash\)tt{strucplot}}$$ framework: visualizing multi-way contingency tables with $${\(\backslash\)tt{vcd}}$$ . J Stat Softw 17(3): 1–48
[15] Meyer D, Zeileis A, Hornik K (2010) $${\(\backslash\)tt{vcd}}$$ : visualizing categorical data. R package version 1.2–9
[16] Münnich R, Schürle J (2003) On the simulation of complex universes in the case of applying the German Microcensus. DACSEIS research paper series No. 4, University of Tübingen
[17] Münnich R, Schürle J, Bihler W, Boonstra HJ, Knotterus P, Nieuwenbroek N, Haslinger A, Laaksonen S, Eckmair D, Quatember A, Wagner H, Renfer JP, Oetliker U, Wiegert R (2003) Monte Carlo simulation study of European surveys. DACSEIS Deliverables D3.1 and D3.2, University of Tübingen
[18] Raghunathan T, Reiter J, Rubin D (2003) Multiple imputation for statistical disclosure limitation. J Off Stat 19(1): 1–16
[19] R Development Core Team (2010) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0
[20] Reiter J (2009) Using multiple imputation to integrate and disseminate confidential microdata. Int Stat Rev 77(2): 179–195
[21] Rubin D (1993) Discussion: statistical disclosure limitation. J Off Stat 9(2): 461–468
[22] Sarkar D (2008) Lattice: multivariate data visualization with R. Springer, New York ISBN 978-0-387-75968-5 · Zbl 1166.62003
[23] Sarkar D (2011) $${\(\backslash\)tt{lattice}}$$ : lattice graphics. R package version 0.19-17
[24] Simonoff J (2003) Analyzing categorical data. Springer, New York ISBN 0-387-00749-0 · Zbl 1028.62003
[25] Templ M, Alfons A (2010) Disclosure risk of synthetic population data with application in the case of EU-SILC. In: Domingo-Ferrer J, Magkos E (eds) Privacy in statistical databases. Lecture notes in computer science, vol 6344. Springer, Heidelberg, pp 174–186
[26] Walker A (1977) An efficient method for generating discrete random variables with general distributions. ACM Trans Math Softw 3(3): 253–256 · Zbl 1148.65301
[27] Weisberg S (2005) Applied linear regression, 3rd edn. Wiley, Hoboken ISBN 0-471-66379-4 · Zbl 1068.62077
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.