Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays. (English) Zbl 1400.62285

Summary: Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g., aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and translocations), and changes to small genomic regions (including single nucleotide polymorphisms). A variety of alterations that occur in chromosomal DNA, many of which can be detected using high density single nucleotide polymorphism (SNP) microarrays, are linked to normal variation as well as disease and are therefore of particular interest. These include changes in copy number (deletions and duplications) and genotype (e.g., the occurrence of regions of homozygosity). Hidden Markov models (HMM) are particularly useful for detecting such alterations, modeling the spatial dependence between neighboring SNPs. Here, we improve previous approaches that utilize HMM frameworks for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding measures of uncertainty when available. Using simulated and experimental data, we, in particular, demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package VanillaICE.


62P10 Applications of statistics to biology and medical sciences; meta analysis
Full Text: DOI arXiv


[1] Affymetrix (2006). Brlmm: An improved genotype calling method for the genechip human mapping 500k array set. Technical report, Affymetrix, Inc.
[2] Aggarwal, A., Leong, S. H., Lee, C., Kon, O. L. and Tan, P. (2005). Wavelet transformations of tumor expression profiles reveals a pervasive genome-wide imprinting of aneuploidy on the cancer transcriptome., Cancer Res. 65 186-194.
[3] Aguirre, A. J., Brennan, C., Bailey, G., Sinha, R., Feng, B., Leo, C., Zhang, Y., Zhang, J., Gans, J. D., Bardeesy, N., Cauwels, C., Cordon-Cardo, C., Redston, M. S., DePinho, R. A. and Chin, L. (2004). High-resolution characterization of the pancreatic adenocarcinoma genome., Proc. Natl. Acad. Sci. USA 101 9067-9072.
[4] Altug-Teber, O., Dufke, A., Poths, S., Mau-Holzmann, U. A., Bastepe, M., Colleaux, L., Cormier-Daire, V., Eggermann, T., Gillessen-Kaesbach, G., Bonin, M. and Riess, O. (2005). A rapid microarray based whole genome analysis for detection of uniparental disomy., 26 153-159.
[5] Beroukhim, R., Lin, M., Park, Y., Hao, K., Zhao, X., Garraway, L. A., Fox, E. A., Hochberg, E. P., Mellinghoff, I. K., Hofer, M. D., Descazeaud, A., Rubin, M. A., Meyerson, M., Wong, W. H., Sellers, W. R. and Li, C. (2006). Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide SNP arrays., PLoS Comput. Biol. 2 e41.
[6] Carvalho, B., Bengtsson, H., Speed, T. P. and Irizarry, R. A. (2007). Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data., Biostatistics 8 485-499. · Zbl 1144.62088
[7] Chambers, J. M. (1998)., Programming with Data . Springer, New York. · Zbl 0902.68022
[8] Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: An objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data., Nucleic Acids Res. 35 2013-2025.
[9] Dempster, A., Laird, D. and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm., J. Roy. Statist. Soc. Ser. B 39 1-38. JSTOR: · Zbl 0364.62022
[10] Di, X., Matsuzaki, H., Webster, T. A., Hubbell, E., Liu, G., Dong, S., Bartell, D., Huang, J., Chiles, R., Yang, G., mei Shen, M., Kulp, D., Kennedy, G. C., Mei, R., Jones, K. W. and Cawley, S. (2005). Dynamic model based algorithms for screening and genotyping over 100 K SNPs on oligonucleotide microarrays., Bioinformatics 21 1958-1963.
[11] Dutt, A. and Beroukhim, R. (2007). Single nucleotide polymorphism array analysis of cancer., Curr. Opin. Oncol. 19 43-49.
[12] Eichler, E. E., Nickerson, D. A., Altshuler, D., Bowcock, A. M., Brooks, L. D., Carter, N. P., Church, D. M., Felsenfeld, A., Guyer, M., Lee, C., Lupski, J. R., Mullikin, J. C., Pritchard, J. K., Sebat, J., Sherry, S. T., Smith, D., Valle, D. and Waterston, R. H. (2007). Completing the map of human genetic variation., Nature 447 161-165.
[13] Eilers, P. H. C. and de Menezes, R. X. (2005). Quantile smoothing of array CGH data., Bioinformatics 21 1146-1153.
[14] Engel, E. (2006). A fascination with chromosome rescue in uniparental disomy: Mendelian recessive outlaws and imprinting copyrights infringements., Eur. J. Hum. Genet. 14 1158-1169.
[15] Freeman, J. L., Perry, G. H., Feuk, L., Redon, R., McCarroll, S. A., Altshuler, D. M., Aburatani, H., Jones, K. W., Tyler-Smith, C., Hurles, M. E., Carter, N. P., Scherer, S. W. and Lee, C. (2006). Copy number variation: New insights in genome diversity., Genome Res. 16 949-961.
[16] Fridlyand, J., Snijders, A., Pinkel, D., Albertson, D. and Jain, A. (2004). Hidden Markov models approach to the analysis of array CGH data., J. Multivariate Anal. 90 132-153. · Zbl 1047.92026
[17] Guha, S., Li, Y. and Neuberg, D. (2006)., Bayesian Hidden Markov Modeling of Array CGH Data . Berkeley Electronic Press. · Zbl 1469.62368
[18] Houseman, E. A., Coull, B. A. and Betensky, R. A. (2006). Feature-specific penalized latent class analysis for genomic data., Biometrics 62 1062-1070. · Zbl 1116.62120
[19] Hsu, L., Self, S. G., Grove, D., Randolph, T., Wang, K., Delrow, J. J., Loo, L. and Porter, P. (2005). Denoising array-based comparative genomic hybridization data using wavelets., Biostatistics 6 211-226. · Zbl 1071.62104
[20] Hua, J., Craig, D. W., Brun, M., Webster, J., Zismann, V., Tembe, W., Joshipura, K., Huentelman, M. J., Dougherty, E. R. and Stephan, D. A. (2007). SNiPer-HD: Improved genotype calling accuracy by an expectation-maximization algorithm for high-density SNP arrays., Bioinformatics 23 57-63.
[21] Huang, J., Wei, W., Chen, J., Zhang, J., Liu, G., Di, X., Mei, R., Ishikawa, S., Aburatani, H., Jones, K. W. and Shapero, M. H. (2006). CARAT: A novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays., BMC Bioinformatics 7 83.
[22] Huang, T., Wu, B., Lizardi, P. and Zhao, H. (2005). Detection of DNA copy number alterations using penalized least squares regression., Bioinformatics 21 3811-3817.
[23] Hupe, P., Stransky, N., Thiery, J. P., Radvanyi, F. and Barillot, E. (2004). Analysis of array CGH data: From signal ratio to gain and loss of DNA regions., Bioinformatics 20 3413-3422.
[24] Kennedy, G. C., Matsuzaki, H., Dong, S., min Liu, W., Huang, J., Liu, G., Su, X., Cao, M., Chen, W., Zhang, J., Liu, W., Yang, G., Di, X., Ryder, T., He, Z., Surti, U., Phillips, M. S., Boyce-Jacino, M. T., Fodor, S. P. A. and Jones, K. W. (2003). Large-scale genotyping of complex DNA., Nat. Biotechnol. 21 1233-1237.
[25] Laframboise, T., Harrington, D. and Weir, B. A. (2006). PLASQ: A generalized linear model-based procedure to determine allelic dosage in cancer cells from SNP array data., Biostatistics 8 323-326. · Zbl 1144.62098
[26] Lai, W. R., Johnson, M. D., Kucherlapati, R. and Park, P. J. (2005). Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data., Bioinformatics 21 3763-3770.
[27] Lai, Y. and Zhao, H. (2005). A statistical method to detect chromosomal regions with DNA copy number alterations using SNP-array-based CGH data., Comput. Biol. Chem. 29 47-54. · Zbl 1095.92054
[28] Lin, M., Wei, L. J., Sellers, W. R., Lieberfarb, M., Wong, W. H. and Li, C. (2004). dChipSNP: Significance curve and clustering of SNP-array-based loss-of-heterozygosity data., Bioinformatics 20 1233-1240.
[29] McClellan, J. M., Susser, E. and King, M. C. (2007). Schizophrenia: A common disease caused by multiple rare alleles., Br. J. Psychiatry 190 194-199.
[30] Nannya, Y., Sanada, M., Nakazaki, K., Hosoya, N., Wang, L., Hangaishi, A., Kurokawa, M., Chiba, S., Bailey, D. K., Kennedy, G. C. and Ogawa, S. (2005). A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays., Cancer Res. 65 6071-6079.
[31] Newton, M. A., Gould, M. N., Reznikoff, C. A. and Haag, J. D. (1998). On the statistical analysis of allelic-loss data., Stat. Med. 17 1425-1445.
[32] Ninomiya, H., Nomura, K., Satoh, Y., Okumura, S., Nakagawa, K., Fujiwara, M., Tsuchiya, E. and Ishikawa, Y. (2006). Genetic instability in lung cancer: Concurrent analysis of chromosomal, mini- and microsatellite instability and loss of heterozygosity., Br. J. Cancer 94 1485-1491.
[33] Olshen, A. B., Venkatraman, E. S., Lucito, R. and Wigler, M. (2004). Circular binary segmentation for the analysis of array-based DNA copy number data., Biostatistics 5 557-572. · Zbl 1155.62478
[34] Picard, F., Robin, S., Lavielle, M., Vaisse, C. and Daudin, J. J. (2005). A statistical approach for array CGH data analysis., BMC Bioinformatics 6 1471-2105.
[35] Rabbee, N. and Speed, T. P. (2006). A genotype calling algorithm for affymetrix SNP arrays., Bioinformatics 22 7-12.
[36] Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition., Proc. IEEE 77 257-286.
[37] Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W. and Hurles, M. E. (2006). Global variation in copy number in the human genome., Nature 444 444-454.
[38] Robinson, W. P. (2000). Mechanisms leading to uniparental disomy and their clinical consequences., Bioessays 22 452-459.
[39] Scharpf, R. B., Ting, J. C., Pevsner, J. and Ruczinski, I. (2007). SNPchip: R classes and methods for SNP array data., Bioinformatics 23 627-628.
[40] Sebat, J., Lakshmi, B., Malhotra, D., Troge, J., Lese-Martin, C., Walsh, T., Yamrom, B., Yoon, S., Krasnitz, A., Kendall, J., Leotta, A., Pai, D., Zhang, R., Lee, Y. H., Hicks, J., Spence, S. J., Lee, A. T., Puura, K., Lehtimaki, T., Ledbetter, D., Gregersen, P. K., Bregman, J., Sutcliffe, J. S., Jobanputra, V., Chung, W., Warburton, D., King, M. C., Skuse, D., Geschwind, D. H., Gilliam, T. C., Ye, K. and Wigler, M. (2007). Strong association of de novo copy number mutations with autism., Science 316 445-449.
[41] Shah, S. P., Xuan, X., DeLeeuw, R. J., Khojasteh, M., Lam, W. L., Ng, R. and Murphy, K. P. (2006). Integrating copy number polymorphisms into array CGH analysis using a robust HMM., Bioinformatics 22 e431-e439.
[42] Shaw-Smith, C., Redon, R., Rickman, L., Rio, M., Willatt, L., Fiegler, H., Firth, H., Sanlaville, D., Winter, R., Colleaux, L., Bobrow, M. and Carter, N. P. (2004). Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features., J. Med. Genet. 41 241-248.
[43] Szatmari, P., Paterson, A. D., Zwaigenbaum, L., Roberts, W., Brian, J., Liu, X. Q., Vincent, J. B., Skaug, J. L., Thompson, A. P., Senman, L., Feuk, L., Qian, C., Bryson, S. E., Jones, M. B., Marshall, C. R., Scherer, S. W., Vieland, V. J., Bartlett, C., Mangin, L. V., Goedken, R., Segre, A., Pericak-Vance, M. A., Cuccaro, M. L., Gilbert, J. R., Wright, H. H., Abramson, R. K., Betancur, C., Bourgeron, T., Gillberg, C., Leboyer, M., Buxbaum, J. D., Davis, K. L., Hollander, E., Silverman, J. M., Hallmayer, J., Lotspeich, L., Sutcliffe, J. S., Haines, J. L., Folstein, S. E., Piven, J., Wassink, T. H., Sheffield, V., Geschwind, D. H., Bucan, M., Brown, W. T., Cantor, R. M., Constantino, J. N., Gilliam, T. C., Herbert, M., Lajonchere, C., Ledbetter, D. H., Lese-Martin, C., Miller, J., Nelson, S., Samango-Sprouse, C. A., Spence, S., State, M., Tanzi, R. E., Coon, H., Dawson, G., Devlin, B., Estes, A., Flodman, P., Klei, L., McMahon, W. M., Minshew, N., Munson, J., Korvatska, E., Rodier, P. M., Schellenberg, G. D., Smith, M., Spence, M. A., Stodgell, C., Tepper, P. G., Wijsman, E. M., Yu, C. E., Roge, B., Mantoulan, C., Wittemeyer, K., Poustka, A., Felder, B., Klauck, S. M., Schuster, C., Poustka, F., Bolte, S., Feineis-Matthews, S., Herbrecht, E., Schmotzer, G., Tsiantis, J., Papanikolaou, K., Maestrini, E., Bacchelli, E., Blasi, F., Carone, S., Toma, C., Van Engeland, H., de Jonge, M., Kemner, C., Koop, F., Langemeijer, M., Hijimans, C., Staal, W. G., Baird, G., Bolton, P. F., Rutter, M. L., Weisblatt, E., Green, J., Aldred, C., Wilkinson, J. A., Pickles, A., Le Couteur, A., Berney, T., McConachie, H., Bailey, A. J., Francis, K., Honeyman, G., Hutchinson, A., Parr, J. R., Wallace, S., Monaco, A. P., Barnby, G., Kobayashi, K., Lamb, J. A., Sousa, I., Sykes, N., Cook, E. H., Guter, S. J., Leventhal, B. L., Salt, J., Lord, C., Corsello, C., Hus, V., Weeks, D. E., Volkmar, F., Tauber, M., Fombonne, E. and Shih, A. (2007). Mapping autism risk loci using genetic linkage and chromosomal rearrangements., Nat. Genet. 39 319-328.
[44] Ting, J., Ye, Y., Thomas, G., Ruczinski, I. and Pevsner, J. (2006). Analysis and visualization of chromosomal abnormalities in SNP data with SNPscan., BMC Bioinformatics 7 25.
[45] Venkatraman, E. S. and Olshen, A. B. (2007). A faster circular binary segmentation algorithm for the analysis of array CGH data., Bioinformatics 23 657-663.
[46] Viterbi, A. (1967). Error bounds for convolution codes and an asymptotically optimal decoding algorithm., IEEE Trans. Inform. Theory 13 260-269. · Zbl 0148.40501
[47] Wang, P., Kim, Y., Pollack, J., Narasimhan, B. and Tibshirani, R. (2005). A method for calling gains and losses in array CGH data., Biostatistics 6 45-58. · Zbl 1069.92014
[48] Wang, W., Carvalho, B., Miller, N., Pevsner, J., Chakravarti, A. and Irizarry, R. A. (2007). Estimating genome-wide copy number using allele specific mixture models. In, RECOMB 137-150.
[49] Willenbrock, H. and Fridlyand, J. (2005). A comparison study: Applying segmentation to array CGH data for downstream analyses., Bioinformatics 21 4084-4091.
[50] Zhao, X., Li, C., Paez, J. G., Chin, K., Jänne, P. A., Chen, T. H., Girard, L., Minna, J., Christiani, D., Leo, C., Gray, J. W., Sellers, W. R. and Meyerson, M. (2004). An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays., Cancer Res. 64 3060-3071.
[51] Zhou, X., Mok. S. C., Chen, Z., Li, Y. and Wong, D. T. W. (2004). Concurrent analysis of loss of heterozygosity (loh) and copy number abnormality (cna) for oral premalignancy progression using the affymetrix 10k SNP mapping array., Hum. Genet. 115 327-330.
[52] Zhou, X., Rao, N. P., Cole, S. W., Mok, S. C., Chen, Z. and Wong, D. T. (2005). Progress in concurrent analysis of loss of heterozygosity and comparative genomic hybridization utilizing high density single nucleotide polymorphism arrays., Cancer Genet. Cytogenet. 159 53-57.
[53] Zlotogora, J. (2004). Parents of children with autosomal recessive diseases are not always carriers of the respective mutant alleles., Hum. Genet. 114 521-526.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.