Assessing selection bias in regression coefficients estimated from nonprobability samples with applications to genetics and demographic surveys. (English) Zbl 1478.62345

Summary: Selection bias is a serious potential problem for inference about relationships of scientific interest based on samples without well-defined probability sampling mechanisms. Motivated by the potential for selection bias in: (a) estimated relationships of polygenic scores (PGSs) with phenotypes in genetic studies of volunteers and (b) estimated differences in subgroup means in surveys of smartphone users, we derive novel measures of selection bias for estimates of the coefficients in linear and probit regression models fitted to nonprobability samples, when aggregate-level auxiliary data are available for the selected sample and the target population. The measures arise from normal pattern-mixture models that allow analysts to examine the sensitivity of their inferences to assumptions about nonignorable selection in these samples. We examine the effectiveness of the proposed measures in a simulation study and then use them to quantify the selection bias in: (a) estimated PGS-phenotype relationships in a large study of volunteers recruited via Facebook and (b) estimated subgroup differences in mean past-year employment duration in a nonprobability sample of low-educated smartphone users. We evaluate the performance of the measures in these applications using benchmark estimates from large probability samples.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62J05 Linear regression; mixed models
62F07 Statistical ranking and selection procedures
92D10 Genetics and epigenetics
Full Text: DOI arXiv


[1] Andridge, R. R. and Little, R. J. (2011). Proxy pattern-mixture analysis for survey nonresponse. J. Off. Stat. 27 153-180.
[2] Andridge, R. R. and Little, R. J. (2020). Proxy pattern-mixture analysis for a binary variable subject to nonresponse. J. Off. Stat. 36 703-728.
[3] Andridge, R. R., West, B. T., Little, R. J. A., Boonstra, P. S. and Alvarado-Leiton, F. (2019). Indices of non-ignorable selection bias for proportions estimated from non-probability samples. J. R. Stat. Soc. Ser. C. Appl. Stat. 68 1465-1483. · doi:10.1111/rssc.12371
[4] Baker, R., Brick, J. M., Bates, N. A., Battaglia, M., Couper, M. P., Dever, J. A. and Tourangeau, R. (2013). Summary report of the AAPOR task force on nonprobability sampling. J. Sur. Stat. Methodol. 1 90-143.
[5] Belsky, D. W. and Israel, S. (2014). Integrating genetics and social science: Genetic risk scores. Biodemogr. Soc. Biol. 60 137-155.
[6] Blumberg, S. and Luke, J. (2018). Wireless substitution: Early release of estimates from the National Health Interview Survey. Available at https://www.cdc.gov/nchs/data/nhis/earlyrelease/wireless201812.pdf.
[7] Boonstra, P. S., Andridge, R. R., West, B. T., Little, R. J. A. and Alvarado-Leiton, F. (2021). A simulation study of diagnostics for selection bias. J. Off. Stat. (in press).
[8] Brick, J. M. and Williams, D. (2013). Explaining rising nonresponse rates in cross-sectional surveys. Ann. Am. Acad. Polit. Soc. Sci. 645 36-59.
[9] Clifford, S., Jewell, R. M. and Waggoner, P. D. (2015). Are samples drawn from Mechanical Turk valid for research on political ideology? Res. Polit. 2 2053168015622072.
[10] International Schizophrenia Consortium (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460 748-752.
[11] Couper, M. P., Gremel, G., Axinn, W. G., Guyer, H., Wagner, J. and West, B. T. (2018). New options for national population surveys: The implications of Internet and smartphone coverage. Soc. Sci. Res. 73 221-235.
[12] de Leeuw, E., Hox, J. and Luiten, A. (2018). International nonresponse trends across countries and years: An analysis of 36 years of Labour Force Survey data. Survey Insights: Methods from the Field. Retrieved from https://surveyinsights.org/?p=10452.
[13] Dudbridge, F. (2016). Polygenic epidemiology. Genet. Epidemiol. 40 268-272.
[14] Elliott, M. R. and Valliant, R. (2017). Inference for nonprobability samples. Statist. Sci. 32 249-264. · Zbl 1381.62024 · doi:10.1214/16-STS598
[15] Glynn, R. J., Laird, N. M. and Rubin, D. B. (1986). Selection modeling versus mixture modeling with nonignorable nonresponse. In Drawing Inferences from Self-Selected Samples (H. Wainer, ed.) 115-142. Springer, New York.
[16] Goldberger, A. S. (1981). Linear regression after selection. J. Econometrics 15 357-366. · doi:10.1016/0304-4076(81)90100-7
[17] Han, J. W., Zheng, H. F., Cui, Y., Sun, L. D., Ye, D. Q., Hu, Z. and Zhang, X. J. (2009). Genome-wide association study in a Chinese Han population identifies nine new susceptibility loci for systemic lupus erythematosus. Nat. Genet. 41 1234-1239.
[18] Heckman, J. J. (1976). The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models. In Annals of Economic and Social Measurement 5 475-492. NBER.
[19] Houlston, R. S., Cheadle, J., Dobbins, S. E., Tenesa, A., Jones, A. M., Howarth, K. and Tomlinson, I. P. M. (2010). Meta-analysis of three genome-wide association studies identifies susceptibility loci for colorectal cancer at 1q41, 3q26.2, 12q13.13 and 20q13.33. Nat. Genet. 42 973-979.
[20] Kapoor, M., Chou, Y. L., Edenberg, H. J., Foroud, T., Martin, N. G., Madden, P. A. F. and Agrawal, A. (2016). Genome-wide polygenic scores for age at onset of alcohol dependence and association with alcohol-related measures. Transl. Psychiatry 6 e761.
[21] Khoury, M. J., Janssens, A. C. J. and Ransohoff, D. F. (2013). How can polygenic inheritance be used in population screening for common diseases? Genet. Med. 15 437-443.
[22] Lewis, C. M. and Vassos, E. (2017). Prospects for using risk scores in polygenic medicine. Gen. Med. 9 96.
[23] Lindgren, C. M., Heid, I. M., Randall, J. C., Lamina, C., Steinthorsdottir, V., Qi, L. and Jackson, A. U. (2009). Correction: Genome-wide association scan meta-analysis identifies three loci influencing adiposity and fat distribution. PLoS Genet. 5 e1000508.
[24] Little, R. J. (1985). A note about models for selectivity bias. Econometrica 53 1469-1474. · Zbl 0582.62099
[25] Little, R. J. A. (1993). Pattern-mixture models for multivariate incomplete data. J. Amer. Statist. Assoc. 88 125-134. · Zbl 0775.62134
[26] Little, R. J. A. (1994). A class of pattern-mixture models for normal incomplete data. Biometrika 81 471-483. · Zbl 0816.62023 · doi:10.1093/biomet/81.3.471
[27] Little, R. J. and Rubin, D. B. (2019). Statistical Analysis with Missing Data, 3rd ed. Wiley, New York. · Zbl 1411.62006
[28] Little, R. J. A., West, B. T., Boonstra, P. and Hu, J. (2020). Measures of the degree of departure from ignorable sample selection. J. Sur. Stat. Methodol. 8 932-964.
[29] Locke, A. E., Kahali, B., Berndt, S. I., Justice, A. E., Pers, T. H., Day, F. R. and Speliotes, E. K. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature 518 197-206.
[30] Maher, B. S. (2015). Polygenic scores in epidemiology: Risk prediction, etiology, and clinical utility. Curr. Epidemiol. Rep. 2 239-244. · doi:10.1007/s40471-015-0055-3
[31] Maity, A. K., Pradhan, V. and Das, U. (2019). Bias reduction in logistic regression with missing responses when the missing data mechanism is nonignorable. Amer. Statist. 73 340-349. · Zbl 07588165 · doi:10.1080/00031305.2017.1407359
[32] Mandel, H. and Semyonov, M. (2014). Gender pay gap and employment sector: Sources of earnings disparities in the United States, 1970-2010. Demography 51 1597-1618.
[33] Martin, A. R., Kanai, M., Kamatani, Y., Okada, Y., Neale, B. M. and Daly, M. J. (2019). Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51 584-591.
[34] Morgan, J. and David, M. (1963). Education and income. Q. J. Econ. 77 423-437.
[35] Muller, A. (2002). Education, income inequality, and mortality: A multiple regression analysis. Br. Med. J. 324 23-25.
[36] Nalls, M. A., Pankratz, N., Lill, C. M., Do, C. B., Hernandez, D. G., Saad, M. and Singleton, A. B. (2014). Large scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson’s disease. Nat. Genet. 46 989-993.
[37] Neale, B. M., Medland, S. E., Ripke, S., Asherson, P., Franke, B., Lesch, K. P. and Daly, M. (2010). Meta-analysis of genome-wide association studies of attention-deficit/hyperactivity disorder. J. Am. Acad. Child Adolesc. Psych. 49 884-897.
[38] Nishimura, R., Wagner, J. and Elliott, M. (2016). Alternative indicators for the risk of non-response bias: A simulation study. Int. Stat. Rev. 84 43-62. · Zbl 07763471 · doi:10.1111/insr.12100
[39] Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511 421-427.
[40] Okbay, A., Beauchamp, J. P., Fontana, M. A., Lee, J. J., Pers, T. H., Rietveld, C. A. and Oskarsson, S. (2016). Genome-wide association study identifies 74 loci associated with educational attainment. Nature 533 539.
[41] Petrolia, D. R. and Bhattacharjee, S. (2009). Revisiting incentive effects: Evidence from a random-sample mail survey on consumer preferences for fuel ethanol. Public Opin. Q. 73 537-550.
[42] Presser, S. and McCulloch, S. (2011). The growth of survey research in the United States: Government-sponsored surveys, 1984-2004. Soc. Sci. Res. 40 1019-1024.
[43] Revilla, M. (2017). Analyzing survey characteristics, participation, and evaluation across 186 surveys in an online opt-in panel in Spain. Methods Data Anal. 11 28.
[44] Rubin, D. B. (1976). Inference and missing data. Biometrika 63 581-592. With comments by R. J. A. Little and a reply by the author. · Zbl 0344.62034 · doi:10.1093/biomet/63.3.581
[45] Ryu, E., Couper, M. P. and Marans, R. W. (2005). Survey incentives: Cash vs. in-kind; face-to-face vs. mail; response rate vs. nonresponse error. Int. J. Public Opin. Res. 18 89-106.
[46] Schouten, B., Cobben, F. and Bethlehem, J. (2009). Indicators for the representativeness of survey response. Surv. Methodol. 35 101-113.
[47] Sklar, P., Ripke, S., Scott, L. J., Andreassen, O. A., Cichon, S., Craddock, N. and Corvin, A. (2011). Large-scale genome-wide association analysis of bipolar disorder identifies a new susceptibility locus near ODZ4. Nat. Genet. 43 977-983.
[48] Stein, M. B., Ware, E. B., Mitchell, C., Chen, C. Y., Borja, S., Cai, T. and Jain, S. (2017). Genome-wide association studies of suicide attempts in US soldiers. Am. J. Med. Genet., Part B Neuropsychiatr. Genet. 174 786-797.
[49] Torkamani, A., Wineinger, N. E. and Topol, E. J. (2018). The personal and clinical utility of polygenic risk scores. Nat. Rev. Genet. 19 581-590.
[50] Valliant, R. (2019). Comparing alternatives for estimation from nonprobability samples. J. Sur. Stat. Methodol. · doi:10.1093/jssam/smz003
[51] Ware, E. B., Schmitz, L. L., Faul, J. D., Gard, A., Mitchell, C., Smith, J. A. and Kardia, S. L. (2017). Heterogeneity in polygenic scores for common human traits. bioRxiv. Available at https://www.biorxiv.org/content/early/2017/02/05/106062.
[52] West, B. T., Little, R. J. A., Andridge, R. R., Boonstra, P. S., Ware, E. B., Pandit, A. and Alvarado-Leiton, F. (2021). Supplement to “Assessing Selection Bias in Regression Coefficients Estimated from Nonprobability Samples with Applications to Genetics and Demographic Surveys.” https://doi.org/10.1214/21-AOAS1453SUPPA, https://doi.org/10.1214/21-AOAS1453SUPPB
[53] Williams, D. and Brick, J. M. (2018). Trends in U.S. face-to-face household survey nonresponse and level of effort. J. Sur. Stat. Methodol. 6 186-211.
[54] Wray, N. R., Goddard, M. E. and Visscher, P. M. (2007). Prediction of individual genetic risk to disease from genome-wide association studies. Genome Res. 17 1520-1528.
[55] Wray, N. R., Yang, J., Hayes, B. J., Price, A. L., Goddard, M. E. and Visscher, P. M. (2013). Pitfalls of predicting complex traits from SNPs. Nat. Rev. Genet. 14 507-515.
[56] Wray, N. R., Lee, S. H., Mehta, D., Vinkhuyzen, A. A., Dudbridge, F. and Middeldorp, C. M. (2014). Research review: Polygenic methods and their application to psychiatric traits. J. Child Psychol. Psychiatry 55 1068-1087
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.