×

zbMATH — the first resource for mathematics

Modeling read counts for CNV detection in exome sequencing data. (English) Zbl 1296.92049
Summary: Varying depth of high-throughput sequencing reads along a chromosome makes it possible to observe copy number variants (CNVs) in a sample relative to a reference. In exome and other targeted sequencing projects, technical factors increase variation in read depth while reducing the number of observed locations, adding difficulty to the problem of identifying CNVs. We present a hidden Markov model for detecting CNVs from raw read count data, using background read depth from a control set as well as other positional covariates such as GC-content. The model, exomeCopy, is applied to a large chromosome X exome sequencing project identifying a list of large unique CNVs. CNVs predicted by the model and experimentally validated are then recovered using a cross-platform control set from publicly available exome sequencing data. Simulations show high sensitivity for detecting heterozygous and homozygous CNVs, outperforming normalization and state-of-the-art segmentation methods.
MSC:
92B15 General biostatistics
92D10 Genetics and epigenetics
PDF BibTeX XML Cite
Full Text: DOI
References:
[1] 1000 Genomes Project Consortium (2010): “A map of human genome variation from population-scale sequencing,” Nature, 467, 1061-1073.
[2] Alkan, C., J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari, J. O. Kitzman, C. Baker, M. Malig, O. Mutlu, S. C. Sahinalp, R. A. Gibbs, and E. E. Eichler (2009): “Personalized copy number and segmental duplication maps using next-generation sequencing,” Nature Genetics, 41, 1061-1067.
[3] Anders, S. and W. Huber (2010): “Differential expression analysis for sequence count data.” Genome biology, 11, R106+.
[4] Benjamini, Y. and T. P. Speed (2011): “Estimation and correction for GC-content bias in high throughput sequencing,” Technical report, University of California at Berkeley.
[5] Bliss, C. I. and R. A. Fisher (1953): “Fitting the Negative Binomial Distribution to Biological Data,” Biometrics, 9.
[6] Boeva, V., A. Zinovyev, K. Bleakley, J.-P. Vert, I. Janoueix-Lerosey, O. Delattre, and E. Barillot (2011): “Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization,” Bioinformatics, 27, 268-269.
[7] Campbell, P. J., P. J. Stephens, E. D. Pleasance, S. O’Meara, H. Li, T. Santarius, L. A. Stebbings, C. Leroy, S. Edkins, C. Hardy, J. W. Teague, A. Menzies, I. Goodhead, D. J. Turner, C. M. Clee, M. A. Quail, A. Cox, C. Brown, R. Durbin, M. E. Hurles, P. A. W. Edwards, G. R. Bignell, M. R. Stratton, and P. A. Futreal (2008): “Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing,” Nature Genetics, 40, 722-729.
[9] Conrad, D. F., D. Pinto, R. Redon, L. Feuk, O. Gokcumen, Y. Zhang, J. Aerts, T. D. Andrews, C. Barnes, P. Campbell, T. Fitzgerald, M. Hu, C. H. Ihm, K. Kristiansson, D. G. MacArthur, J. R. MacDonald, I. Onyiah, A. W. Pang, S. Robson, K. Stirrups, A. Valsesia, K. Walter, J. Wei, C. Tyler-Smith, N. P. Carter, C. Lee, S. W. Scherer, and M. E. Hurles (2010): “Origins and functional impact of copy number variation in the human genome,” Nature, 464, 704-712.
[10] Fridlyand, J. (2004): “Hidden Markov models approach to the analysis of array CGH data,” Journal of Multivariate Analysis, 90, 132-153. · Zbl 1047.92026
[11] Gentleman, R., V. Carey, D. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Yang, and J. Zhang (2004): “Bioconductor: open software development for computational biology and bioinformatics,” Genome Biology, 5, R80+.
[12] Glessner, J. T., K. Wang, G. Cai, O. Korvatska, C. E. Kim, S. Wood, H. Zhang, A. Estes, C. W. Brune, J. P. Bradfield, M. Imielinski, E. C. Frackelton, J. Reichert, E. L. Crawford, J. Munson, P. M. A. Sleiman, R. Chiavacci, K. Annaiah, K. Thomas, C. Hou, W. Glaberson, J. Flory, F. Otieno, M. Garris, L. Soorya, L. Klei, J. Piven, K. J. Meyer, E. Anagnostou, T. Sakurai, R. M. Game, D. S. Rudd, D. Zurawiecki, C. J. McDougle, L. K. Davis, J. Miller, D. J. Posey, S. Michaels, A. Kolevzon, J. M. Silverman, R. Bernier, S. E. Levy, R. T. Schultz, G. Dawson, T. Owley, W. M. McMahon, T. H. Wassink, J. A. Sweeney, J. I. Nurnberger, H. Coon, J. S. Sutcliffe, N. J. Minshew, S. F. A. Grant, M. Bucan, E. H. Cook, J. D. Buxbaum, B. Devlin, G. D. Schellenberg, and H. Hakonarson (2009): “Autism genome-wide copy number variation reveals ubiquitin and neuronal genes,” Nature, 459, 569-573.
[13] Gonzalez, E., H. Kulkarni, H. Bolivar, A. Mangano, R. Sanchez, G. Catano, R. J. Nibbs, B. I. Freedman, M. P. Quinones, M. J. Bamshad, K. K. Murthy, B. H. Rovin, W. Bradley, R. A. Clark, S. A. Anderson, R. J. O’Connell, B. K. Agan, S. S. Ahuja, R. Bologna, L. Sen, M. J. Dolan, and S. K. Ahuja (2005): “The Influence of CCL3L1 Gene-Containing Segmental Duplications on HIV-1/AIDS Susceptibility,” Science, 307, 1434-1440.
[15] Hedges, D. J., T. Guettouche, S. Yang, G. Bademci, A. Diaz, A. Andersen, W. F. Hulme, S. Linker, A. Mehta, Y. J. K. Edwards, G. W. Beecham, E. R. Martin, M. A. Pericak-Vance, S. Zuchner, J. M. Vance, and J. R. Gilbert (2011): “Comparison of Three Targeted Enrichment Strategies on the SOLiD Sequencing Platform,” PLoS ONE, 6, e18595+.
[16] Herman, D. S., G. K. Hovingh, O. Iartchouk, H. L. Rehm, R. Kucherlapati, J. G. Seidman, and C. E. Seidman (2009): “Filter-based hybridization capture of subgenomes enables resequencing and copy-number detection.” Nature methods, 6, 507-510.
[17] Ivakhno, S., T. Royce, A. J. Cox, D. J. Evers, R. K. Cheetham, and S. Tavaré (2010): “CNAsega novel framework for identification of copy number changes in cancer from second-generation sequencing data,” Bioinformatics, 26, 3051-3058.
[18] Kleinjan, D.-J. and V. van Heyningen (1998): “Position Effect in Human Genetic Disease,” Human Molecular Genetics, 7, 1611-1618.
[19] Li, Y., N. Vinckenbosch, G. Tian, E. Huerta-Sanchez, T. Jiang, H. Jiang, A. Albrechtsen, G. Andersen, H. Cao, T. Korneliussen, N. Grarup, Y. Guo, I. Hellman, X. Jin, Q. Li, J. Liu, X. Liu, T. Sparso, M. Tang, H. Wu, R. Wu, C. Yu, H. Zheng, A. Astrup, L. Bolund, J. Holmkvist, T. Jorgensen, K. Kristiansen, O. Schmitz, T. W. Schwartz, X. Zhang, R. Li, H. Yang, J. Wang, T. Hansen, O. Pedersen, R. Nielsen, and J. Wang (2010): “Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants,” Nature Genetics, 42, 969-972.
[20] Madrigal, I., L. Rodríguez-Revenga, L. Armengol, E. González, B. Rodriguez, C. Badenas, A. Sánchez, F. Martínez, M. Guitart, I. Fernández, J. A. Arranz, M. Tejada, L. A. Pérez-Jurado, X. Estivill, and M. Milà (2007): “X-chromosome tiling path array detection of copy number variants in patients with chromosome X-linked mental retardation.” BMC genomics, 8, 443+.
[21] Marioni, J. C., N. P. Thorne, and S. Tavaré (2006): “BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.” Bioinformatics, 22, 1144-1146.
[22] Medvedev, P., M. Stanciu, and M. Brudno (2009): “Computational methods for discovering structural variation with next-generation sequencing,” Nature Methods, 6, S13-S20.
[24] Nord, A., M. Lee, M. C. King, and T. Walsh (2011): “Accurate and exact CNV identification from targeted high-throughput sequence data,” BMC Genomics, 12, 184+.
[25] O’Roak, B. J., P. Deriziotis, C. Lee, L. Vives, J. J. Schwartz, S. Girirajan, E. Karakoc, A. P. MacKenzie, S. B. Ng, C. Baker, M. J. Rieder, D. A. Nickerson, R. Bernier, S. E. Fisher, J. Shendure, and E. E. Eichler (2011): “Exome sequencing in sporadic autism spectrum disorders identifies severe de novo mutations,” Nature Genetics, 43, 585-589.
[26] Pang, A., J. MacDonald, D. Pinto, J. Wei, M. Rafiq, D. Conrad, H. Park, M. Hurles, C. Lee, J. C. Venter, E. Kirkness, S. Levy, L. Feuk, and S. Scherer (2010): “Towards a comprehensive structural variation map of an individual human genome,” Genome Biology, 11, R52+.
[27] Pruitt, K. D., J. Harrow, R. A. Harte, C. Wallin, M. Diekhans, D. R. Maglott, S. Searle, C. M. Farrell, J. E. Loveland, B. J. Ruef, E. Hart, M.-M. M. Suner, M. J. Landrum, B. Aken, S. Ayling, R. Baertsch, J. Fernandez-Banet, J. L. Cherry, V. Curwen, M. Dicuccio, M. Kellis, J. Lee, M. F. Lin, M. Schuster, A. Shkeda, C. Amid, G. Brown, O. Dukhanina, A. Frankish, J. Hart, B. L. Maidak, J. Mudge, M. R. Murphy, T. Murphy, J. Rajan, B. Rajput, L. D. Riddick, C. Snow, C. Steward, D. Webb, J. A. Weber, L. Wilming, W. Wu, E. Birney, D. Haussler, T. Hubbard, J. Ostell, R. Durbin, and D. Lipman (2009): “The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes.” Genome research, 19, 1316-1323.
[28] R Development Core Team (2011): R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria.
[29] Rabiner, L. R. (1989): “A tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, 77, 257-286.
[30] Robinson, M. D., D. J. McCarthy, and G. K. Smyth (2010): “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics (Oxford, England), 26, 139-140.
[31] Sathirapongsasuti, J. F., H. Lee, B. A. Horst, G. Brunner, A. J. Cochran, S. Binder, J. Quackenbush, and S. F. Nelson (2011): “Exome Sequencing-Based Copy-Number Variation and Loss of Heterozygosity Detection: ExomeCNV.” Bioinformatics (Oxford, England).
[33] Shen, J. J. and N. R. Zhang (2011): “Change-Point Model on Non-Homogeneous Poisson Processes with Application in Copy Number Profiling by Next-Generation DNA Sequencing,” Technical report, Division of Biostatistics, Stanford University.
[34] St Clair, D. (2009): “Copy number variation and schizophrenia.” Schizophrenia bulletin, 35, 9-12.
[35] Venkatraman, E. S. and A. B. Olshen (2007): “A faster circular binary segmentation algorithm for the analysis of array CGH data,” Bioinformatics, 23, 657-663.
[36] Weese, D., A.-K. Emde, T. Rausch, A. Döring, and K. Reinert (2009): “RazerSfast read mapping with sensitivity control,” Genome Research, 19, 1646-1654.
[37] Xie, C. and M. Tammi (2009): “CNV-seq, a new method to detect copy number variation using high-throughput sequencing,” BMC Bioinformatics, 10, 80+.
[38] Yoon, S., Z. Xuan, V. Makarov, K. Ye, and J. Sebat (2009): “Sensitive and accurate detection of copy number variants using read depth of coverage,” Genome Research, 19, 1586-1592.
[39] Zhang, J., L. Feuk, G. E. Duggan, R. Khaja, and S. W. Scherer (2006): “Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome,” Cytogenetic and Genome Research, 115, 205-214.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.