×

Genomic feature selection by coverage design optimization. (English) Zbl 1516.62565

Summary: We introduce a novel data reduction technique whereby we select a subset of tiles to ‘cover’ maximally events of interest in large-scale biological datasets (e.g. genetic mutations), while minimizing the number of tiles. A tile is a genomic unit capturing one or more biological events, such as a sequence of base pairs that can be sequenced and observed simultaneously. The goal is to reduce significantly the number of tiles considered to those with areas of dense events in a cohort, thus saving on cost and enhancing interpretability. However, the reduction should not come at the cost of too much information, allowing for sensible statistical analysis after its application. We envisage application of our methods to a variety of high throughput data types, particularly those produced by next-generation sequencing (NGS) experiments. The procedure is cast as a convex optimization problem, which is presented, along with methods of its solution. The method is demonstrated on a large dataset of somatic mutations spanning 5000+ patients, each having one of 29 cancer types. Applied to these data, our method dramatically reduces the number of gene locations required for broad coverage of patients and their mutations, giving subject specialists a more easily interpretable snapshot of recurrent mutational profiles in these cancers. The locations identified coincide with previously identified cancer genes. Finally, despite considerable data reduction, we show that our covering designs preserve the cancer discrimination ability of multinomial logistic regression models trained on all of the locations \((>1M)\).

MSC:

62-XX Statistics

Software:

FilMINT; Bonmin; Dendrix
PDFBibTeX XMLCite
Full Text: DOI Link

References:

[1] Abhishek, K.; Leyffer, S.; Linderoth, J., Filmint: an outer-approximation-based solver for convex mixed-integer nonlinear programs, INFORMS. J. Comput., 22, 555-567 (2010) · Zbl 1243.90142 · doi:10.1287/ijoc.1090.0373
[2] Bien, J.; Tibshirani, R., Prototype selection for interpretable classification, Ann. Appl. Statist., 5, 2403-2424 (2011) · Zbl 1234.62096 · doi:10.1214/11-AOAS495
[3] Bonami, P.; Biegler, L.; Conn, A.; Cornujols, G.; Grossman, I.; Laird, C.; Lee, J.; Lodi, F.; Margot, F.; Sawaya, N., An algorithmic framework for convex mixed integer nonlinear programs, Discret. Optim., 5, 186-204 (2008) · Zbl 1151.90028 · doi:10.1016/j.disopt.2006.10.011
[4] Burer, S.; Letchford, A. N., Non-convex mixed-integer nonlinear programming: A survey, Surveys Oper. Res. Management Sci., 17, 97-106 (2012)
[5] Deng, J.; Shoemaker, R.; Xie, B.; Gore, A.; LeProust, E.; Antosiewicz-Bourget, J.; Egli, D.; Maherali, N.; Park, I.; Yu, J.; Daley, G.; Eggan, K.; Hochedlinger, K.; Thomson, J.; Wang, W.; Gao, Y.; Zhang, K., Targeted bisulfite sequencing reveals changes in dna methylation associated with nuclear reprogramming, Nat. Biotechnol., 27, 353-360 (2009) · doi:10.1038/nbt.1530
[6] Duran, M. A.; Grossman, I. E., An outer-approximation algorithm for a class of mixed-integer nonlinear programs, Math. Program., 36, 307-339 (1986) · Zbl 0619.90052 · doi:10.1007/BF02592064
[7] Efron, B.; Hastie, T.; Johnstone, I.; Tibshirani, R., Least angle regression, Ann. Stat., 32, 407-499 (2004) · Zbl 1091.62054 · doi:10.1214/009053604000000067
[8] Geoffrion, A. M., Generalized benders decomposition, J. Optim. Theory Appl., 10, 237-260 (1972) · Zbl 0229.90024 · doi:10.1007/BF00934810
[9] Gupta, O.; Ravindran, V., Branch and bound experiments in convex linear programming, Manage. Sci., 31, 1533-1546 (1985) · Zbl 0591.90065 · doi:10.1287/mnsc.31.12.1533
[10] Lawrence, M. S.; Stojanov, P.; Polak, P.; Kryukov, G. V.; Cibulskis, K.; Sivachenko, A.; Carter, S. L.; Stewart, C.; Mermel, C. H.; Roberts, S. A.; Kiezun, A.; Hammerman, P. S.; McKenna, A.; Drier, Y.; Zou, L.; Ramos, A. H.; Pugh, T. J.; Stransky, N.; Helman, E.; Kim, J.; Sougnez, C.; Ambrogio, L.; Nickerson, E.; Shefler, E.; Cortes, M. L.; Auclair, D.; Saksena, G.; Voet, D.; Noble, M.; DiCara, D.; Lin, P.; Lichtenstein, L.; Heiman, D. I.; Fennell, T.; Imielinski, M.; Hernandez, B.; Hodis, E.; Baca, S.; Dulak, A. M.; Lohr, J.; Landau, D. -A.; Wu, C. J.; Melendez-Zajgla, J.; Hidalgo-Miranda, A.; Koren, A.; McCarroll, S. A.; Mora, J.; Lee, R. S.; Crompton, B.; Onofrio, R.; Parkin, M.; Winckler, W.; Ardlie, K.; Gabriel, S. B.; Roberts, C. W.M.; Biegel, J. A.; Stegmaier, K.; Bass, A. J.; Garraway, L. A.; Meyerson, M.; Golub, T. R.; Gordenin, D. A.; Sunyaev, S.; Lander, E. S.; Getz, G., Mutational heterogeneity in cancer and the search for new cancer-associated genes, Nature, 499, 214-218 (2013)
[11] Mercer, T. R.; Gerhardt, D. J.; Dinger, M. E.; Crawford, J.; Trapnell, C.; Jeddeloh, J. A.; Mattick, J. S.; Rinn, J. L., Targeted rna sequencing reveals the deep complexity of the human transcriptome, Nat. Biotechnol., 30, 99-104 (2011) · doi:10.1038/nbt.2024
[12] Newman, A. M.; Bratman, S. V.; To, J.; Wynne, J. F.; Eclov, N. C.W.; Modlin, L. A.; Liu, C. L.; Neal, J. W.; Wakelee, H. A.; Merritt, R. E.; Shrager, J. B.; Loo Jr, B. W.; Alizadeh, A. A.; Diehn, M., An ultrasensitive method for quantitating circulating tumor dna with broad patient coverage, Nat. Med., 20, 548-554 (2014) · doi:10.1038/nm.3519
[13] Quesada, I.; Grossman, I. E., An lp/nlp based branch and bound algorithm for convex minlp optimization problems, Comput. Chem. Eng., 16, 937-947 (1992) · doi:10.1016/0098-1354(92)80028-8
[14] Rehm, H. L., Disease-targeted sequencing: A cornerstone in the clinic, Nat. Rev. Genet., 14, 295-300 (2012) · doi:10.1038/nrg3463
[15] Stubbs, R. A.; Mehrota, S., A branch-and-cut method for 0-1 mixed convex programming, Math. Program., 86, 515-532 (1999) · Zbl 0946.90054 · doi:10.1007/s101070050103
[16] Tamborero, D.; Gonzalez-Perez, A.; Perez-Llamas, C.; Deu-Pons, J.; Kandoth, C.; Reimand, J.; Lawrence, M. S.; Getz, G.; Bader, G. D.; Ding, L.; Lopez-Bigas, N., Comprehensive identification of mutational cancer driver genes across 12 tumor types, Sci. Rep., 3 (2013)
[17] Tibshirani, R., Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Ser. B, 58, 267-288 (1996) · Zbl 0850.62538
[18] Vandin, F.; Upfal, E.; Raphael, B., De novo discovery of mutated driver pathways in cancer, Genome. Res., 22, 375-385 (2012) · doi:10.1101/gr.120477.111
[19] Welander, J.; Andreasson, A.; Juhlin, C.; Wiseman, R.; Bäckdahl, M.; Höög, A.; Larsson, C.; Gimm, O.; Söderkvist, P., Rare germline mutations identified by targeted next-generation sequencing of susceptibility genes in pheochromocytoma and paraganglioma, J. Clin. Endocrinol. Metab., 99, 1352-1360 (2014) · doi:10.1210/jc.2013-4375
[20] Westerlund, T.; Pettersson, F., An extended cutting plane method for solving convex MINLP problems, Comput. Chem. Eng., 19, 131-136 (1995) · doi:10.1016/0098-1354(95)87027-X
[21] Zou, H.; Hastie, T., Regularization and variable selection via the elastic net, J. R. Stat. Soc. Ser. B, 67, 301-320 (2005) · Zbl 1069.62054 · doi:10.1111/j.1467-9868.2005.00503.x
[22] Zou, H.; Hastie, T.; Tibshirani, R., Sparse principal component analysis, J. Comput. Graph. Stat., 15, 265-286 (2006) · doi:10.1198/106186006X113430
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.