A practical approach to adjusting for population stratification in genome-wide association studies: principal components and propensity scores (PCAPS). (English) Zbl 1420.92080

Summary: Genome-wide association studies (GWAS) are susceptible to bias due to population stratification (PS). The most widely used method to correct bias due to PS is principal components (PCs) analysis (PCA), but there is no objective method to guide which PCs to include as covariates. Often, the ten PCs with the highest eigenvalues are included to adjust for PS. This selection is arbitrary, and patterns of local linkage disequilibrium may affect PCA corrections. To address these limitations, we estimate genomic propensity scores based on all statistically significant PCs selected by the Tracy-Widom (TW) statistic. We compare a principal components and propensity scores (PCAPS) approach to PCA and EMMAX using simulated GWAS data under no, moderate, and severe PS. PCAPS reduced spurious genetic associations regardless of the degree of PS, resulting in odds ratio (OR) estimates closer to the true OR. We illustrate our PCAPS method using GWAS data from a study of testicular germ cell tumors. PCAPS provided a more conservative adjustment than PCA. Advantages of the PCAPS approach include reduction of bias compared to PCA, consistent selection of propensity scores to adjust for PS, the potential ability to handle outliers, and ease of implementation using existing software packages.


92D10 Genetics and epigenetics
62P10 Applications of statistics to biology and medical sciences; meta analysis
62H25 Factor analysis and principal components; correspondence analysis
Full Text: DOI Link


[1] Airy, G. (1838): “On the intensity of light in the neighbourhood of a caustic,” Thans. Cambr. Phil. Soc., 6, 379-402.
[2] Allen, A., M. P. Epstein and G. A. Satten (2010): “Score-based adjustment for confounding by population stratification in genetic association studies,” Genet. Epidemiol., 34(5), 383-385.
[3] Bouaziz, M., C. Ambroise and M. Guedj (2011): “Accounting for population stratification in practice: a comparison of the main strategies dedicated to genome-wide association studies,” PLoS One, 6, e28845.
[4] Cepeda, M. S., R. Boston, J. T. Farrar and B. L. Strom (2003): “Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders,” Am J Epidemiol, 158, 280-287.
[5] Chen, H., C. Wang, M. P. Conomos, A. M. Stilp, Z. Li, T. Sofer, A. A. Szpiro, W. Chen, J. M. Brehm, J. C. Celedón, S. Redline, G. J. Papanicolaou, T. A. Thornton, C. C. Laurie, K. Rice and X. Lin (2016): “Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models,” Am. J. Hum. Genet., 98, 653-666.
[6] de Andrade, M., D. Ray, A. C. Pereira and J. P. Soler (2015): “Global individual ancestry using principal components for family data,” Hum. Hered., 80, 1-11.
[7] Devlin, B. and K. Roeder (1999): “Genomic control for association studies,” Biometrics, 55, 997-1004. · Zbl 1059.62640
[8] Dominici, D. and R. S. Maier (2008): Special Functions and Orthogonal Polynomials, American Mathematical Society.
[9] Drake, C. (1993): “Effects of misspecification of the propensity score on estimators of treatment effect,” Biometrics, 49, 1231-1236.
[10] Epstein, M. P., A. S. Allen and G. A. Satten (2007): “A simple and improved correction for population stratification in case-control studies,” Am. J. Hum. Genet., 80, 921-930.
[11] Epstein, M. P., R. Duncan, K. A. Broadaway, M. He, A. S. Allen and G. A. Satten (2012): “Stratification-score matching improves correction for confounding by population stratification in case-control association studies,” Genet. Epidemiol., 36, 195-205.
[12] Feng, Q., J. Abraham, T. Feng, Y. Song, R. C. Elston and X. Zhu (2009): “A method to correct for population structure using a segregation model,” BMC Proc., 3(Suppl 7), S104.
[13] Hastings, S. P. and J. B. McLeod (1980): “A boundary value problem associated with the second Painleve transcendent and the Korteweg-de Vries equation,” Arch. Ration. Mech. An., 73, 31-51. · Zbl 0426.34019
[14] Imbens, G. W. (2004): “Nonparametric estimation of average treatment effects under exogeneity: a review,” Rev. Econ. Stat., 86, 4-29.
[15] Johnstone, I. M. (2001): “On the distribution of the largest eigenvalue in principal components analysis,” Ann. Stat., 29, 295-327. · Zbl 1016.62078
[16] Kanetsky, P. A., N. Mitra, S. Vardhanabhuti, M. Li, D. J. Vaughn, R. Letrero, S. L. Ciosek, D. R. Doody, L. M. Smith, J. Weaver, A. Albano, C. Chen, J. R. Starr, D. J. Rader, A. K. Godein, M. P. Reilly, H. Hakonarson, S. M. Schwartz and K. L. Nathanson (2009): “Common variation in KITLG and at 5q31.3 predisposes to testicular germ cell cancer,” Nat. Genet., 41, 811-815.
[17] Kang, H. M., J. H. Sul, S. K. Service, N. A. Zaitlen, S.-Y. Kong, N. B. Freimer, C. Sabatti and E. Eskin (2010): “Variance component model to account for sample structure in genome-wide association studies,” Nat. Gene., 42, 348-354.
[18] Kang, S. J., E. K. Larkin, Y. Song, J. Barnholtz-Sloan, D. Baechle, T. Feng and X. Zhu (2009): “Assessing the impact of global versus local ancestry in association studies,” BMC Proc., 3(Suppl 7), S107.
[19] Lee, A. B., D. Luca, L. Klei, B. Devlin and K. Roeder (2010): “Discovering genetic ancestry using spectral graph theory,” Genet. Epidemiol., 34, 51-59.
[20] Li, C. and M. Li (2008): “GWAsimulator: a rapid whole-genome simulation program,” Bioinformatics, 24, 140-142.
[21] Li, Q., S. Wacholder, D. J. Hunter, R. N. Hoover, S. Chanock, G. Thomas and K. Yu (2009): “Genetic background comparison using distance-based regression, with applications in population stratification evaluation and adjustment,” Genet. Epidemiol., 33, 432-441.
[22] Li, Q., and K. Yu (2008): “Improved correction for population stratification in genomewide association studies by identifying hidden population structures,” Genet. Epidemiol., 32, 215-226.
[23] Lin, D. Y. and D. Zeng. (2011): “Correcting for population stratification in genomewide association studies,” J. Am. Stat. Assoc., 106, 997-1008. · Zbl 1229.62148
[24] Liu, L., D. Zhang, H. Liu and C. Arendt (2013): “Robust methods for population stratification in genome wide association studies,” BMC Bioinformatics, 14, 132.
[25] Luca, D., S. Ringquist, L. Klei, A. B. Lee, C. Gieger, H. E. Wichmann, S. Schreiber, M. Krawczak, Y. Lu, A. Styche, B. Devlin, K. Roeder and M. Trucco (2008): “On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants,” Am. J. Hum. Genet., 82, 453-63.
[26] Lunceford, J. K. and M. Davidian (2004): “Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study,” Stat. Med., 23, 2937-2960.
[27] McPeek, M. and M. Abney (2008): “Association testing with principal-components-based correction for population stratification,” The American Society of Human Genetics, November 13, 2008, Philadelphia, PA.
[28] Patterson, N., A. L. Price and D. Reich (2006): “Population structure and eigenanalysis,” PLoS Genet., 2, e190.
[29] Price, A. L., N. J. Patterson, R. M. Plenge, M. E. Weinblatt, N. A. Shadick and D. Reich (2006): “Principal components analysis corrects for stratification in genome-wide association studies,” Nat. Genet., 38, 904-909.
[30] Price, A. L., N. A. Zaitlen, D. Reich and N. Patterson (2010): “New approaches to population stratification in genome-wide association studies,” Nat. Rev. Genet., 11, 459-463.
[31] Pritchard, J. K. and P. Donnelly (2001): “Case-control studies of association in structured or admixed populations,” Theor. Popul. Biol., 60, 227-237.
[32] Pritchard, J. K., M. Stephens, N. A. Rosenberg and P. Donnelly (2000): “Association mapping in structured populations,” Am. J. Hum. Genet., 67, 170-181.
[33] Purcell, S., B. Neale, K. Todd-Brown, L. Thomas, M. A. Ferreira, D. Bender, J. Maller, P. Sklar, P. I. de Bakker, M. J. Daly and P. C. Sham (2007): “PLINK: a tool set for whole-genome association and population-based linkage analyses,” Am. J. Hum. Genet., 81, 559-575.
[34] Ray, D. and S. Basu (2017): “A novel association test for multiple secondary phenotypes from a case-control GWAS,” Genet. Epidemiol., 41, 413-426.
[35] Rosenbaum, P. R. and D. B. Rubin (1983): “The central role of the propensity score in observational studies for causal effects,” Biometrika, 70, 41-55. · Zbl 0522.62091
[36] Tracy, C. A. and H. Widom (1993): “Level-spacing distributions and the Airy kernel,” Phys. Lett. B., 305, 115-118.
[37] Tracy, C. A. and H. Widom (1994): “Level-spacing distributions and the Airy kernel,” Commun. Math. Phys., 159, 151-174. · Zbl 0789.35152
[38] Tracy, C. A. and H. Widom (1996): “On orthogonal and symplectic matrix ensembles,” Commun. Math. Phys., 177, 727-754. · Zbl 0851.60101
[39] Voight, B. F. and J. K. Pritchard (2005): “Confounding from cryptic relatedness in case-control association studies,” PLoS Genet., 1:e32.
[40] Wan, F. and N. Mitra (2016): “An evaluation of bias in propensity score adjusted non-linear regression models,” Stat. Methods Med. Res., 27:846-862.
[41] Wang, D., Y. Sun, P. Stang, J. A. Berlin, M. A. Wilcox and Q. Li (2009): “Comparison of methods for correcting population stratification in a genome-wide association study of rheumatoid arthritis: Principal-component analysis versus multidimensional scaling,” BMC Proc., 3(Suppl 7), S109.
[42] Weir, B. S., A. D. Anderson and A. B. Hepler (2006): “Genetic relatedness analysis: modern data and new challenges,” Nat. Rev. Genet., 7, 771-780.
[43] Zhang, Y. and W. Pan (2015): “Principal component regression and linear mixed model in associaiton analysis of structured samples: competitors or complements?,” Genet. Epidemiol., 39, 149-155.
[44] Zhang, Z., E. Ersoz, C.-Q. Lai, R. J. Todhunter and H. K. Tiwari (2010): “Mixed linear model approach adapted for genome-wide association studies,” Nat. Genet., 42, 355-360.
[45] Zhang, Y., W. Guan and W. Pan (2013a): “Adjustment for population stratification via principal components in association analysis of rare variants,” Genet. Epidemiol., 37, 99-109.
[46] Zhang, Y., X. Shen and W. Pan (2013b): “Adjusting for population stratification in a fine scale with principal components and sequencing data,” Genet. Epidemiol., 37, 787-801.
[47] Zhao, H., T. R. Rebbeck and N. Mitra (2009): “A propensity score approach to correction for bias due to population stratification using genetic and non-genetic factors,” Genet. Epidemiol., 33, 679-690.
[48] Zhao, H., T. R. Rebbeck and N. Mitra (2012): “Analyzing genetic association studies with an extended propensity score approach,” Stat. Appl. Genet. Mol. Biol., 11, ISSN (Online) 1544-6115, DOI: . · Zbl 1296.92179
[49] Zhu, X., S. Li, R. S. Cooper and R. C. Elston (2008): “A unified association analysis approach for family and unrelated samples correcting for stratificaiton,” Am. J. Hum. Genet., 82, 352-365.
[50] Zou, F., S. Lee, R. Knowles and F. A. Wright (2010): “Quantification of population structure using correlated SNPs by shrinkage principal components,” Hum. Hered., 70, 9-22.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.