Reconstructing DNA copy number by penalized estimation and imputation. (English) Zbl 1220.62146

Summary: Recent advances in genomics have underscored the surprising ubiquity of DNA copy number variation (CNV). Fortunately, modern genotyping platforms also detect CNVs with fairly high reliability. Hidden Markov models and algorithms have played a dominant role in the interpretation of CNV data. Here we explore CNV reconstruction via estimation with a fused-lasso penalty as suggested by R. Tibshirani and P. Wang [Biostatistics 9, No. 1, 18–29 (2008; Zbl 1274.62886)]. We mount a fresh attack on this difficult optimization problem by the following: (a) changing the penalty terms slightly by substituting a smooth approximation to the absolute value function, (b) designing and implementing a new MM (majorization-minimization) algorithm, and (c) applying a fast version of Newton’s method to jointly update all model parameters. Together these changes enable us to minimize the fused-lasso criterion in a highly effective way.
We also reframe the reconstruction problem in terms of imputation via discrete optimization. This approach is easier and more accurate than parameter estimation because it relies on the fact that only a handful of possible copy number states exist at each SNP. The dynamic programming framework has the added bonus of exploiting information that the current fused-lasso approach ignores. The accuracy of our imputations is comparable to that of hidden Markov models at a substantially lower computational cost.


62P10 Applications of statistics to biology and medical sciences; meta analysis
92C40 Biochemistry, molecular biology
92C37 Cell biology
90C90 Applications of mathematical programming
92D10 Genetics and epigenetics


Zbl 1274.62886


PennCNV; VanillaICE
Full Text: DOI arXiv


[1] Bioucas-Diaa, J. M., Figueiredo, M. A. T. and Oliveira, J. P. (2006). Adaptive total variation image deconvolution: A majorization-minimization approach. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’06) . Toulouse, France.
[2] Candès, E. J. and Plan, Y. (2009). Near-ideal model selection by \ell 1 minimization. Ann. Statist. 37 2145-2177. · Zbl 1173.62053
[3] Chan, T. F. and Shen, J. (2002). Mathematical models for local nontexture inpainting. SIAM J. Appl. Math. 62 1019-1043. · Zbl 1050.68157
[4] Colella, S., Yau, C., Taylor, J. M., Mirza, G., Butler, H., Clouston, P., Bassett, A. S., Seller, A., Holmes, C. C. and Ragoussis, J. (2007). QuantiSNP: An objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Research 35 2013-2025.
[5] Conte, S. D. and deBoor, C. (1972). Elementary Numerical Analysis . McGraw-Hill, New York.
[6] Diskin, S. J., Li, M., Hou, C., Yang, S., Glessner, J., Hakonarson, H., Bucan, M., Maris, J. M. and Wang, K. (2008). Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Research 36 e126.
[7] Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika 81 425-455. · Zbl 0815.62019
[8] Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302-332. · Zbl 1378.90064
[9] Iafrate, A. J., Feuk, L., Rivera, M. N., Listewnik, M. L., Donahoe, P. K., Qi, Y., Scherer, S. and Lee, C. (2004). Detection of large-scale variation in the human genome. Nature Genetics 36 949-951.
[10] Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H. C., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R., Bras, J. M., Schymick, J. C., Hernandez, D. G., Traynor, B. J., Simon-Sanchez, J., Matarin, M., Britton, A., van de Leemput, J., Rafferty, I., Bucan, M., Cann, H. M., Hardy, J. A., Rosenberg, N. A. and Singleton, A. B. (2008). Genotype, haplotype and copy-number variation in worldwide human populations. Nature 451 998-1003.
[11] Kim, S.-J., Koh, K., Boyd, S. and Gorinevsky, D. (2009). \ell 1 trend filtering. SIAM Review 51 339-360. · Zbl 1171.37033
[12] Korn, J. M., Kuruvilla, F. G., McCarroll, S. A., Wysoker, A., Nemesh, J., Cawley, S., Hubbell, E., Veitch, J., Collins, P. J., Darvishi, K., Lee, C., Nizzari, M. M., Gabriel, S. B., Purcell, S., Daly, M. J. and Altshuler, D. D. (2008). Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nature Genetics 40 1253-1260.
[13] Lange, K. (2004). Optimization . Springer, New York. · Zbl 1140.90004
[14] Li, Y. and Zhu, J. (2007). Analysis of array CGH data for cancer studies using fused quantile regression. Bioinformatics 23 2470-2476. · Zbl 1022.68519
[15] Negahban, S., Ravikmuar, P., Wainwright, M. J. and Yu, B. (2009). A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers. In The Neural Information Processing Systems Conference (NIPS’09) . Vancouver, Canada.
[16] Redon, R., Ishikawa, S., Fitch, K. R., Feuk, L., Perry, G. H., Andrews, T. D., Fiegler, H., Shapero, M. H., Carson, A. R., Chen, W., Cho, E. K., Dallaire, S., Freeman, J. L., Gonzalez, J. R., Gratacos, M., Huang, J., Kalaitzopoulos, D., Komura, D., MacDonald, J. R., Marshall, C. R., Mei, R., Montgomery, L., Nishimura, K., Okamura, K., Shen, F., Somerville, M. J., Tchinda, J., Valsesia, A., Woodwark, C., Yang, F., Zhang, J., Zerjal, T., Zhang, J., Armengol, L., Conrad, D. F., Estivill, X., Tyler-Smith, C., Carter, N. P., Aburatani, H., Lee, C., Jones, K. W., Scherer, S. W. and Hurles, M. E. (2006). Global variation in copy number in the human genome. Nature 444 444-454.
[17] Rudin, L. I., Osher, S. and Fatemi, E. (1992). Nonlinear total variation based noise removal algorithms. Physica D 60 259-268. · Zbl 0780.49028
[18] Scharpf, R. B., Parmigiani, G., Pevsner, J. and Ruczinski, I. (2008). Hidden Markov models for the assessment of chromosomal alterations using high throughput SNP arrays. Ann. Appl. Statist. 2 687-713. · Zbl 1400.62285
[19] Sebat, J., Lakshmi, B., Troge, J., Alexander, J., Young, J., Lundin, P., Maner, S., Massa, H., Walker, M., Chi, M., Navin, N., Lucito, R., Healy, J., Hicks, J., Ye, K., Reiner, A., Gilliam, T. C., Trask, B., Patterson, N., Zetterberg, A. and Wigler, M. (2004). Large-scale copy number polymorphism in the human genome. Science 305 525-528.
[20] Stefansson, H., Rujescu, D., Cichon, S., Pietiläinen, O. P. H., Ingason, A., Steinberg, S., Fossdal, R., Sigurdsson, E., Sigmundsson, T., Buizer-Voskamp, J. E., Hansen, T., Jakobsen, K. D., Muglia, P., Francks, C., Matthews, P. M., Gylfason, A., Halldorsson, B. V., Gudbjartsson, D., Thorgeirsson, T. E., Sigurdsson, A., Jonasdottir, A., Jonasdottir, A., Bjornsson, A., Mattiasdottir, S., Blondal, T., Haraldsson, M., Magnusdottir, B. B., Giegling, I., Möller, H.-J., Hartmann, A., Shianna, K. V., Ge, D., Need, A. C., Crombie, C., Fraser, G., Walker, N., Lonnqvist, J., Suvisaari, J., Tuulio-Henriksson, A., Paunio, T., Toulopoulou, T., Bramon, E., Di Forti, M., Murray, R., Ruggeri, M., Vassos, E., Tosato, S., Walshe, M., Li, T., Vasilescu, C., Mühleisen, T. W., Wang, A. G., Ullum, H., Djurovic, S., Melle, I., Olesen, J., Kiemeney, L. A., Franke, B., Genetic Risk and Outcome in Psychosis (GROUP), Sabatti, C., Freimer, N. B., Gulcher, J. R., Thorsteinsdottir, U., Kong, A., Andreassen, O. A., Ophoff, R. A., Georgi, A., Rietschel, M., Werge, T., Petursson, H., Goldstein, D. B., Nöthen, M. M., Peltonen, L., Collier, D. A., St Clair, D. and Stefansson, K. (2008). Large recurrent microdeletions associated with schizophrenia. Nature 455 232-236.
[21] Tibshirani, R. and Wang, P. (2008). Spatial smoothing and hot spot detection for CGH data using the Fused Lasso. Biostatistics 9 18-29. · Zbl 1274.62886
[22] Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67 91-108. · Zbl 1060.62049
[23] Vrijenhoek, T., Buizer-Voskamp, J. E., van der Stelt, I., Strengman, E., Genetic Risk and Outcome in Psychosis (GROUP) Consortium, Sabatti, C., van Kessel, A. G., Brunner, H. G., Ophoff, R. A. and Veltman, J. A. (2008). Recurrent CNVs disrupt three candidate genes in schizophrenia patients. The American Journal of Human Genetics 83 504-510.
[24] Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F. A., Hakonarson, H. and Bucan, M. (2007). PennCNV: An integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 17 1665-1674.
[25] Wang, H., Veldink, J. H., Blauw, H., van den Berg, L. H., Ophoff, R. A. and Sabatti, C. (2009). Markov models for inferring copy number variations from genotype data on Illumina platforms. Human Heredity 68 1-22.
[26] Wu, T. T. and Lange, K. (2008). Coordinate descent algorithm for lasso penalized regression. Ann. Appl. Statist. 2 224-244. · Zbl 1137.62045
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.