##
**Latent demographic profile estimation in hard-to-reach groups.**
*(English)*
Zbl 1257.62122

Summary: The sampling frame in most social science surveys excludes members of certain groups, known as hard-to-reach groups. These groups, or subpopulations, may be difficult to access (the homeless, e.g.), camouflaged by stigma (individuals with HIV/AIDS), or both (commercial sex workers). Even basic demographic information about these groups is typically unknown, especially in many developing nations. We present statistical models which leverage social network structures to estimate demographic characteristics of these subpopulations using aggregated relational data (ARD), or questions of the form “How many X’s do you know?” Unlike other network-based techniques for reaching these groups, ARD require no special sampling strategy and are easily incorporated into standard surveys. ARD also do not require respondents to reveal their own group membership.

We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data of C. McCarty et al. [Comparing two methods for estimating network size. Human Organization 60, 28–39 (2001)], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.

We propose a Bayesian hierarchical model for estimating the demographic characteristics of hard-to-reach groups, or latent demographic profiles, using ARD. We propose two estimation techniques. First, we propose a Markov-chain Monte Carlo algorithm for existing data or cases where the full posterior distribution is of interest. For cases when new data can be collected, we propose guidelines and, based on these guidelines, propose a simple estimate motivated by a missing data approach. Using data of C. McCarty et al. [Comparing two methods for estimating network size. Human Organization 60, 28–39 (2001)], we estimate the age and gender profiles of six hard-to-reach groups, such as individuals who have HIV, women who were raped, and homeless persons. We also evaluate our simple estimates using simulation studies.

### MSC:

62P25 | Applications of statistics to social sciences |

62D05 | Sampling theory, sample surveys |

62F15 | Bayesian inference |

91D30 | Social networks; opinion dynamics |

65C40 | Numerical analysis or methods applied to Markov chains |

65C60 | Computational problems in statistics (MSC2010) |

### Keywords:

aggregated relational data; hard-to-reach populations; hierarchical models; social networks; survey design### Software:

SDaA
PDFBibTeX
XMLCite

\textit{T. H. McCormick} and \textit{T. Zheng}, Ann. Appl. Stat. 6, No. 4, 1795--1813 (2012; Zbl 1257.62122)

### References:

[1] | Bernard, H. R., Johnsen, E. C., Killworth, P. D. and Robinson, S. (1991). Estimating the size of an average personal network and of an event subpopulation: Some empirical results. Social Science Research 20 109-121. |

[2] | Centers for Disease Control (2011). WISQARS leading causes of death reports. |

[3] | Centers for Disease Control and Prevention (2011). Centers for Disease Control and Prevention, National Center for Injury Prevention and Control. Web-based Injury Statistics Query and Reporting System (WISQARS). |

[4] | Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1-38. · Zbl 0364.62022 |

[5] | DiPrete, T. A., Gelman, A., McCormick, T. H., Teitler, J. and Zheng, T. (2011). Segregation in social networks based on acquaintanceship and trust. American Journal of Sociology 116 1234-1283. |

[6] | Goel, S. and Salganik, M. J. (2009). Respondent-driven sampling as Markov chain Monte Carlo. Stat. Med. 28 2202-2229. · doi:10.1002/sim.3613 |

[7] | Heckathorn, D. D. (1997). Respondent-driven sampling: A new approach to the study of hidden populations. Social Problems 44 174-199. |

[8] | Heckathorn, D. D. (2002). Respondent-driven sampling II: Deriving valid population estimates from chain-referral samples of hidden populations. Social Problems 49 11-34. |

[9] | Hoff, P. D. (2005). Bilinear mixed-effects models for dyadic data. J. Amer. Statist. Assoc. 100 286-295. · Zbl 1117.62353 · doi:10.1198/016214504000001015 |

[10] | Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 1090-1098. · Zbl 1041.62098 · doi:10.1198/016214502388618906 |

[11] | Killworth, P. D., Johnsen, E. C., Bernard, H. R., Shelley, G. A. and McCarty, C. (1990). Estimating the size of personal networks. Social Networks 12 289-312. |

[12] | Killworth, P. D., McCarty, C., Bernard, H. R., Shelly, G. A. and Johnsen, E. C. (1998a). Estimation of seroprevalence, rape, and homelessness in the U.S. using a social network approach. Evaluation Review 22 289-308. |

[13] | Killworth, P. D., Johnsen, E. C., McCarty, C., Shelly, G. A. and Bernard, H. R. (1998b). A social network approach to estimating seroprevalence in the United States. Social Networks 20 23-50. |

[14] | Killworth, P. D., McCarty, C., Bernard, H. R., Johnsen, E. C., Domini, J. and Shelly, G. A. (2003). Two interpretations of reports of knowledge of subpopulation sizes. Social Networks 25 141-160. |

[15] | Lavallée, P. (2007). Indirect Sampling . Springer, New York. · Zbl 1183.62015 |

[16] | Lawson, C. L. and Hanson, R. J. (1974). Solving Least Squares Problems . Prentice Hall International, Englewood Cliffs, NJ. · Zbl 0860.65028 |

[17] | Lohr, S. L. (1999). Sampling : Design and Analysis . Duxbury Press, Belmont, CA. · Zbl 0967.62005 |

[18] | McCarty, C., Killworth, P. D., Bernard, H. R., Johnsen, E. and Shelley, G. A. (2001). Comparing two methods for estimating network size. Human Organization 60 28-39. |

[19] | McCormick, T. H., Salganik, M. J. and Zheng, T. (2010). How many people do you know? Efficiently estimating personal network size. J. Amer. Statist. Assoc. 105 59-70. · Zbl 1397.62051 · doi:10.1198/jasa.2009.ap08518 |

[20] | McCormick, T. H. and Zheng, T. (2007). Adjusting for recall bias in “How Many X’s Do You Know?” surveys. In Proceedings of the Joint Statistical Meetings . Salt Lake City, UT. |

[21] | McCormick, T. H., Moussa, A., Ruf, J., DiPrete, T. A., Gelman, A., Teitler, J. and Zheng, T. (2009). Comparing two methods for predicting opinions using social structure. In Proceedings of the Joint Statistical Meetings . Washington, DC. |

[22] | Federal Bureau of Investigation (1999). Crime in the United States. |

[23] | Office of Advocacy, U.S. Small Business Administration (1997). Characteristics of small business employees and owners. |

[24] | Salganik, M. J., Mello, M. B., Adbo, A. H., Bertoni, N., Fatzio, D. and Bastos, F. I. (2011). The game of contacts: Estimating the social visibility of groups. Social Networks 33 70-78. · Zbl 1119.62388 · doi:10.1198/016214505000001168 |

[25] | Shelley, G., Bernard, H., Killworth, P., Johnsen, E. and McCarty, C. (1995). Who knows you HIV status? What HIV+ patients and their network members know about each other. Social Networks 17 189-217. |

[26] | UNAIDS (2003). Estimating the size of popualtions at risk for HIV 03.36E. UNAIDS, Geneva. |

[27] | Zheng, T., Salganik, M. J. and Gelman, A. (2006). How many people do you know in prison?: Using overdispersion in count data to estimate social structure in networks. J. Amer. Statist. Assoc. 101 409-423. · Zbl 1119.62388 · doi:10.1198/016214505000001168 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.