##
**Asymptotic seed bias in respondent-driven sampling.**
*(English)*
Zbl 1439.62061

Summary: Respondent-driven sampling (RDS) collects a sample of individuals in a networked population by incentivizing the sampled individuals to refer their contacts into the sample. This iterative process is initialized from some seed node(s). Sometimes, this selection creates a large amount of seed bias. Other times, the seed bias is small. This paper gains a deeper understanding of this bias by characterizing its effect on the limiting distribution of various RDS estimators. Using classical tools and results from multi-type branching processes [H. Kesten and B. P. Stigum, Ann. Math. Stat. 37, 1463–1481 (1966; Zbl 0203.17402)], we show that the seed bias is negligible for the Generalized Least Squares (GLS) estimator and non-negligible for both the inverse probability weighted and Volz-Heckathorn (VH) estimators. In particular, we show that (i) above a critical threshold, VH converge to a non-trivial mixture distribution, where the mixture component depends on the seed node, and the mixture distribution is possibly multi-modal. Moreover, (ii) GLS converges to a Gaussian distribution independent of the seed node, under a certain condition on the Markov process. Numerical experiments with both simulated data and empirical social networks suggest that these results appear to hold beyond the Markov conditions of the theorems.

### MSC:

62D05 | Sampling theory, sample surveys |

60J20 | Applications of Markov chains and discrete-time Markov processes on general state spaces (social mobility, learning theory, industrial processes, etc.) |

60J80 | Branching processes (Galton-Watson, birth-and-death, etc.) |

62E20 | Asymptotic distribution theory in statistics |

60F17 | Functional limit theorems; invariance principles |

### Citations:

Zbl 0203.17402### References:

[1] | Athreya, K. B. and Ney, P. E. (2004)., Branching processes. Courier Corporation. |

[2] | Baraff, A. J., McCormick, T. H. and Raftery, A. E. (2016). Estimating uncertainty in respondent-driven sampling using a tree bootstrap method., Proceedings of the National Academy of Sciences 201617258. |

[3] | Benjamini, I. and Peres, Y. (1994). Markov chains indexed by trees., The Annals of Probability 219-243. · Zbl 0793.60080 · doi:10.1214/aop/1176988857 |

[4] | CDC (2017). National HIV Behavioral Surveillance (NHBS)., Division of HIV/AIDS Prevention. |

[5] | Durrett, R. (2019)., Probability: theory and examples 49. Cambridge university press. · Zbl 1440.60001 |

[6] | Goel, S. and Salganik, M. J. (2009). Respondent-driven sampling as Markov chain Monte Carlo., Statistics in medicine 28 2202-2229. |

[7] | Harris, K. M. (2011). The national longitudinal study of adolescent health: Research design., http://www.cpc.unc.edu/projects/addhealth/design. |

[8] | Harris, T. E. (2002)., The theory of branching processes. Courier Corporation. · Zbl 1037.60001 |

[9] | Heckathorn, D. D. (1997). Respondent-driven sampling: a new approach to the study of hidden populations., Social problems 44 174-199. |

[10] | Holland, P. W. and Laskey, K. B. (1983). Stochastic blockmodels: First steps., Social Networks 5 109-137. |

[11] | Johnston, L. (2013). Introduction to HIV/AIDS and sexually transmitted infection surveillance: Module 4: Introduction to Respondent Driven Sampling., World Health Organization. |

[12] | Kesten, H. and Stigum, B. P. (1966). Additional limit theorems for indecomposable multidimensional Galton-Watson processes., The Annals of Mathematical Statistics 37 1463-1481. · Zbl 0203.17402 · doi:10.1214/aoms/1177699139 |

[13] | Levin, D. A., Peres, Y. and Wilmer, E. L. (2009)., Markov chains and mixing times. American Mathematical Soc. · Zbl 1160.60001 |

[14] | Li, X. and Rohe, K. (2017). Central limit theorems for network driven sampling., Electronic Journal of Statistics 11 4871-4895. · Zbl 1386.60144 · doi:10.1214/17-EJS1333 |

[15] | Roch, S. and Rohe, K. (2018). Generalized least squares can overcome the critical threshold in respondent-driven sampling., Proceedings of the National Academy of Sciences 115 10299-10304. · Zbl 1416.62091 · doi:10.1073/pnas.1706699115 |

[16] | Rohe, K. (2019). A critical threshold for design effects in network sampling., The Annals of Statistics 47 556-582. · Zbl 1417.62021 · doi:10.1214/18-AOS1700 |

[17] | Salganik, M. J. and Heckathorn, D. D. (2004). Sampling and estimation in hidden populations using respondent-driven sampling., Sociological methodology 34 193-240. |

[18] | Volz, E. and Heckathorn, D. D. (2008). Probability based estimation theory for respondent driven sampling., Journal of official statistics 24 79. |

[19] | White, H. C., Boorman, S. A. and Breiger, R. L. (1976). Social Structure from Multiple Networks. I. Blockmodels of Roles and Positions., American Journal of Sociology 81 730-780. |

[20] | White, R. |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.