×

Extending models via gradient boosting: an application to Mendelian models. (English) Zbl 1478.62329

Summary: Improving existing widely-adopted prediction models is often a more efficient and robust way toward progress than training new models from scratch. Existing models may: (a) incorporate complex mechanistic knowledge, (b) leverage proprietary information, and (c) have surmounted barriers to adoption. Compared to model training, model improvement and modification receive little attention. In this paper we propose a general approach to model improvement: we combine gradient boosting with any previously developed model to improve model performance while retaining important existing characteristics. To exemplify, we consider the context of Mendelian models which estimate the probability of carrying genetic mutations that confer susceptibility to disease by using family pedigrees and health histories of family members. Via simulations, we show that integration of gradient boosting with an existing Mendelian model can produce an improved model that outperforms both that model and the model built using gradient boosting alone. We illustrate the approach on genetic testing data from the USC-Stanford Cancer Genetics Hereditary Cancer Panel (HCP) study.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62M20 Inference from stochastic processes and prediction

References:

[1] Antoniou, A., Cunningham, A., Peto, J., Evans, D., Lalloo, F., Narod, S., Risch, H., Eyfjord, J., Hopper, J. et al. (2008). The BOADICEA model of genetic susceptibility to breast and ovarian cancers: Updates and extensions. Br. J. Cancer 98 1457.
[2] Ayer, M., Brunk, H. D., Ewing, G. M., Reid, W. T. and Silverman, E. (1955). An empirical distribution function for sampling with incomplete information. Ann. Math. Stat. 26 641-647. · Zbl 0066.38502 · doi:10.1214/aoms/1177728423
[3] Barnetson, R. A., Tenesa, A., Farrington, S. M., Nicholl, I. D., Cetnarskyj, R., Porteous, M. E., Campbell, H. and Dunlop, M. G. (2006). Identification and survival of carriers of mutations in DNA mismatch-repair genes in colon cancer. N. Engl. J. Med. 354 2751-2763.
[4] Barrow, E., Robinson, L., Alduaij, W., Shenton, A., Clancy, T., Lalloo, F., Hill, J. and Evans, D. (2009). Cumulative lifetime incidence of extracolonic cancers in Lynch syndrome: A report of 121 families with proven mutations. Clin. Genet. 75 141-149.
[5] Bernau, C., Riester, M., Boulesteix, A.-L., Parmigiani, G., Huttenhower, C., Waldron, L. and Trippa, L. (2014). Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30 i105-i112.
[6] Braun, D., Yang, J., Griffin, M., Parmigiani, G. and Hughes, K. S. (2018). A clinical decision support tool to predict cancer risk for commonly tested cancer-related germline mutations. J. Genet. Couns. 27 1187-1199.
[7] Breiman, L. (1996). Bagging predictors. Mach. Learn. 24 123-140. · Zbl 0858.68080
[8] Brunk, H. D. (1955). Maximum likelihood estimates of monotone parameters. Ann. Math. Stat. 26 607-616. · Zbl 0066.38503 · doi:10.1214/aoms/1177728420
[9] Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining 785-794. ACM, New York.
[10] Chen, S. and Parmigiani, G. (2007). Meta-analysis of BRCA1 and BRCA2 penetrance. J. Clin. Oncol. 25 1329-1333.
[11] Chen, S., Wang, W., Broman, K. W., Katki, H. A. and Parmigiani, G. (2004). BayesMendel: An R environment for Mendelian risk prediction. Stat. Appl. Genet. Mol. Biol. 3 21. · Zbl 1077.92029 · doi:10.2202/1544-6115.1063
[12] Chen, S., Wang, W., Lee, S., Nafa, K., Lee, J., Romans, K., Watson, P., Gruber, S. B., Euhus, D. et al. (2006). Prediction of germline mutations and cancer risk in the Lynch syndrome. JAMA 296 1479-1487.
[13] Couch, F. J., DeShano, M. L., Blackwood, M. A., Calzone, K., Stopfer, J., Campeau, L., Ganguly, A., Rebbeck, T., Weber, B. L. et al. (1997). BRCA1 mutations in women attending clinics that evaluate the risk of breast cancer. N. Engl. J. Med. 336 1409-1415.
[14] DevCan (2012). DevCan: Probability of Developing or Dying of Cancer Software, Version 6.7.5. Surveillance Research Program, Statistical Methodology and Applications, National Cancer Institute. Available at http://surveillance.cancer.gov/devcan/.
[15] Dowty, J. G., Win, A. K., Buchanan, D. D., Lindor, N. M., Macrae, F. A., Clendenning, M., Antill, Y. C., Thibodeau, S. N., Casey, G. et al. (2013). Cancer risks for MLH1 and MSH2 mutation carriers. Human Mutat. 34 490-497.
[16] Elston, R. C. and Stewart, J. (1971). A general model for the genetic analysis of pedigree data. Hum. Hered. 21 523-542.
[17] Fay, M. P. (2004). Estimating age conditional probability of developing disease from surveillance data. Popul. Health Metr. 2 6. · doi:10.1186/1478-7954-2-6
[18] Fay, M. P., Pfeiffer, R., Cronin, K. A., Le, C. and Feuer, E. J. (2003). Age-conditional probabilities of developing cancer. Stat. Med. 22 1837-1848.
[19] Fishel, R., Lescoe, M. K., Rao, M., Copeland, N. G., Jenkins, N. A., Garber, J., Kane, M. and Kolodner, R. (1993). The human mutator gene homolog MSH2 and its association with hereditary nonpolyposis colon cancer. Cell 75 1027-1038.
[20] Flossmann, E., Rothwell, P. M. et al. (2007). Effect of aspirin on long-term risk of colorectal cancer: Consistent evidence from randomised and observational studies. Lancet 369 1603-1613.
[21] Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Ann. Statist. 29 1189-1232. · Zbl 1043.62034 · doi:10.1214/aos/1013203451
[22] Friedman, J. H. (2002). Stochastic gradient boosting. Comput. Statist. Data Anal. 38 367-378. · Zbl 1072.65502
[23] Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Statist. 28 337-407. · Zbl 1106.62323 · doi:10.1214/aos/1016218223
[24] Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. Springer, New York. · Zbl 1273.62005 · doi:10.1007/978-0-387-84858-7
[25] Huang, T., Idos, G., Hong, C., Gruber, S., Parmigiani, G. and Braun, D. (2021). Supplement to “Extending models via gradient boosting: An application to Mendelian models.” https://doi.org/10.1214/21-AOAS1482SUPPA, https://doi.org/10.1214/21-AOAS1482SUPPB
[26] Idos, G., Kurian, A. W., Ricker, C., Sturgeon, D., Culver, J., Kingham, K., Koff, R., Chun, N. M., Rowe-Teeter, C. et al. (2018). Promoting breast cancer screening after multiplex genetic panel testing (MGPT) and genetic counseling.
[27] Janssen, K. J. M., Moons, K. G. M., Kalkman, C. J., Grobbee, D. E. and Vergouwe, Y. (2008). Updating methods improved the performance of a clinical prediction model in new patients. J. Clin. Epidemiol. 61 76-86. · doi:10.1016/j.jclinepi.2007.04.018
[28] Kastrinos, F., Idos, G. and Parmigiani, G. (2018). Prediction models for lynch syndrome. In Hereditary Colorectal Cancer: Genetic Basis and Clinical Implications (L. Valle, S. B. Gruber and G. Capella, eds.) 281-303. Springer, Cham.
[29] Lynch, H. T. and Smyrk, T. (1996). Hereditary nonpolyposis colorectal cancer (Lynch syndrome): An updated review. Cancer: Interdiscip. Int. J. Am. Cancer Soc. 78 1149-1167.
[30] Marroni, F., Aretini, P., D’Andrea, E., Caligo, M. A., Cortesi, L., Viel, A., Ricevuto, E., Montagna, M., Cipollini, G. et al. (2004). Penetrances of breast and ovarian cancer in a large series of families tested for BRCA1/2 mutations. Eur. J. Hum. Genet. 12 899.
[31] Miyaki, M., Konishi, M., Tanaka, K., Kikuchi-Yanoshita, R., Muraoka, M., Yasuno, M., Igari, T., Koike, M., Chiba, M. et al. (1997). Germline mutation of MSH6 as the cause of hereditary nonpolyposis colorectal cancer. Nat. Genet. 17 271.
[32] MØller, P., Seppälä, T. T., Bernstein, I., Holinski-Feder, E., Sala, P., Evans, D. G., Lindblom, A., Macrae, F., Blanco, I. et al. (2018). Cancer risk and survival in path_MMR carriers by gene and gender up to 75 years of age: A report from the Prospective Lynch Syndrome Database. Gut 67 1306-1316.
[33] Murphy, E. and Mutalik, G. (1969). The application of Bayesian methods in genetic counselling. Hum. Hered. 19 126-151.
[34] Natekin, A. and Knoll, A. (2013). Gradient boosting machines, a tutorial. Front. Neurorobot. 7 21. · doi:10.3389/fnbot.2013.00021
[35] Papadopoulos, N., Nicolaides, N. C., Wei, Y.-F., Ruben, S. M., Carter, K. C., Rosen, C. A., Haseltine, W. A., Fleischmann, R. D., Fraser, C. M. et al. (1994). Mutation of a mutL homolog in hereditary colon cancer. Science 263 1625-1629.
[36] Parmigiani, G., Chen, S., Iversen, E. S. Jr., Friebel, T. M., Finkelstein, D. M., Anton-Culver, H., Ziogas, A., Weber, B. L., Eisen, A. et al. (2007). Validity of models for predicting BRCA1 and BRCA2 mutations. Ann. Intern. Med. 147 441-450.
[37] Platt, J. et al. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10 61-74.
[38] Steyerberg, E. W., Harrell, F. E. Jr., Borsboom, G. J., Eijkemans, M., Vergouwe, Y. and Habbema, J. D. F. (2001). Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis. J. Clin. Epidemiol. 54 774-781.
[39] Steyerberg, E. W., Vickers, A. J., Cook, N. R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M. J. and Kattan, M. W. (2010). Assessing the performance of prediction models: A framework for some traditional and novel measures. Epidemiology 21 128.
[40] Su, T.-L., Jaki, T., Hickey, G. L., Buchan, I. and Sperrin, M. (2018). A review of statistical updating methods for clinical prediction models. Stat. Methods Med. Res. 27 185-197. · doi:10.1177/0962280215626466
[41] Vahteristo, P., Eerola, H., Tamminen, A., Blomqvist, C. and Nevanlinna, H. (2001). A probability model for predicting BRCA1 and BRCA2 mutations in breast and breast-ovarian cancer families. Br. J. Cancer 84 704-708. · doi:10.1054/bjoc.2000.1626
[42] Van Calster, B., Nieboer, D., Vergouwe, Y., De Cock, B., Pencina, M. J. and Steyerberg, E. W. (2016). A calibration hierarchy for risk models was defined: From utopia to empirical data. J. Clin. Epidemiol. 74 167-176.
[43] Wolpert, D. H. (1992). Stacked generalization. Neural Netw. 5 241-259.
[44] Zhang, Y., Bernau, C., Parmigiani, G. and Waldron, L. (2020). The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 21 253-268 · doi:10.1093/biostatistics/kxy044
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.