Zhou, Baiyu; Whittemore, Alice S. Improving sequence-based genotype calls with linkage disequilibrium and pedigree information. (English) Zbl 1243.62138 Ann. Appl. Stat. 6, No. 2, 457-475 (2012). Summary: Whole and targeted sequencing of human genomes is a promising, increasingly feasible tool for discovering genetic contributions to risk of complex diseases. A key step is calling an individual’s genotype from the multiple aligned short read sequences of his DNA, each of which is subject to nucleotide read error. Current methods are designed to call genotypes separately at each locus from the sequence data of unrelated individuals. Here we propose likelihood-based methods that improve calling accuracy by exploiting two features of sequence data. The first is the linkage disequilibrium (LD) between nearby single nucleotide polymrphisms (SNPs). The second is the Mendelian pedigree information available when related individuals are sequenced. In both cases the likelihood involves the probabilities of read variant counts given genotypes, summed over the unobserved genotypes. Parameters governing the prior genotype distribution and the read error rates can be estimated either from the sequence data itself or from external reference data. We use simulations and synthetic read data based on the 1000 Genomes Project to evaluate the performance of the proposed methods. An R-program to apply the methods to small families is freely available at http://med.stanford.edu/epidemiology/PHGC/. MSC: 62P10 Applications of statistics to biology and medical sciences; meta analysis 92C40 Biochemistry, molecular biology 92D10 Genetics and epigenetics 65C60 Computational problems in statistics (MSC2010) 62F10 Point estimation Keywords:human genome sequencing Software:SeqEM PDFBibTeX XMLCite \textit{B. Zhou} and \textit{A. S. Whittemore}, Ann. Appl. Stat. 6, No. 2, 457--475 (2012; Zbl 1243.62138) Full Text: DOI arXiv Euclid References: [1] Genomes Project Consortium. (2010). A map of human genome variation from population-scale sequencing. Nature 467 1061-1073. [2] Bansal, V. et al. (2010). Accurate detection and genotyping of SNPs utilizing population sequencing data. Genome Research 20 537-545. [3] Bentley, D. R. et al. (2008). Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456 53-59. [4] Bross, I. (1954). Misclassification in \(2\times 2\) tables. Biometrics 10 478-486. · Zbl 0058.13103 · doi:10.2307/3001619 [5] Clayton, D. G. et al. (2005). Population structure, differential bias and genomic control in a large-scale, case-control association study. Nature Genetics 37 1243-1246. [6] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol. 39 1-38. · Zbl 0364.62022 [7] Drmanac, R. et al. (2010). Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science 327 78-81. [8] Gordon, D. et al. (2002). Power and sample size calculation for case-control genetic association tests when errors are present: Application to single nucleotide polymorphisms. Human Heredity 54 22-23. [9] Kim, S. Y. et al. (2010). Design of association studies with pooled or un-pooled next-generation sequencing data. Genetic Epidemiology 34 479-491. [10] Kruglyak, L., Daly, M. J., Reeve-Daly, M. P. and Lander, E. S. (1996). Parametric and nonparametric linkage analysis: A unified multipoint approach. Am. J. Hum. Genet. 58 1347-1363. [11] Li, H. et al. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research 18 1851-1858. [12] Lin, Y., Tseng, G. C., Cheong, S. Y., Bean, L. J. H., Sherman, S. L. and Feingold, E. (2008). Smarter clustering methods for SNP genotype calling. Bioinformatics 24 2665-2671. [13] Martin, E. R. (2010). SeqEM: An adaptive genotype-calling approach for next-generation sequencing studies. Bioinformatics 26 2803-2810. [14] McKernan, K. J. et al. (2009). Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Research 19 1527-1541. [15] Nielsen, R. et al. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12 443-451. [16] Sabatti, C. and Lange, K. (2008). Bayesian Gaussian mixture models for high-density genotyping arrays. J. Amer. Statist. Assoc. 103 89-100. · Zbl 1469.62380 · doi:10.1198/016214507000000338 [17] Thompson, E. A. (1974). Gene identities and multiple relationships. Biometrics 30 667-680. · Zbl 0292.92004 · doi:10.2307/2529231 [18] Whittemore, A. S. and Halpern, J. (1994). A class of tests for linkage using affected pedigree members. Biometrics 50 118-127. · Zbl 0824.62100 · doi:10.2307/2533202 [19] Yu, Z. et al. (2009). Genotype determination for polymorphisms in linkage disequilibrium. BMC Bioinformatics 10 63. This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.