Correcting classifiers for sample selection bias in two-phase case-control studies.

*(English)*Zbl 1398.92012Summary: Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this so-called sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on two-phase case-control studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-probability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverse-probability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms state-of-the-art procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package sambia.

##### MSC:

92B15 | General biostatistics |

62P10 | Applications of statistics to biology and medical sciences; meta analysis |

62D05 | Sampling theory, sample surveys |

92D30 | Epidemiology |

PDF
BibTeX
XML
Cite

\textit{N. Krautenbacher} et al., Comput. Math. Methods Med. 2017, Article ID 7847531, 18 p. (2017; Zbl 1398.92012)

Full Text:
DOI

##### References:

[1] | Rossiter, C. E.; Schlesselman, J. J., Case-Control Studies. Design, Conduct, Analysis., Biometrics, 39, 3, 821, (1983) |

[2] | Steyerberg, E. W.; Borsboom, G. J. J. M.; van Houwelingen, H. C.; Eijkemans, M. J. C.; Habbema, J. D. F., Validation and updating of predictive logistic regression models: A study on sample size and shrinkage, Statistics in Medicine, 23, 16, 2567-2586, (2004) |

[3] | Huang, Y.; Pepe, M. S., Assessing risk prediction models in case-control studies using semiparametric and nonparametric methods, Statistics in Medicine, 29, 13, 1391-1410, (2010) |

[4] | Rose, S.; van der Laan, M., A Note on Risk Prediction for Case-Control Studies, 2008 |

[5] | Janssen, K. J. M.; Vergouwe, Y.; Kalkman, C. J.; Grobbee, D. E.; Moons, K. G. M., A simple method to adjust clinical prediction models to local circumstances, Canadian Journal of Anesthesia, 56, 3, 194-201, (2009) |

[6] | White, J. E., A two stage design for the study of the relationship between a rare exposure and a rare disease, American Journal of Epidemiology, 115, 1, 119-128, (1982) |

[7] | Satagopan, J. M.; Venkatraman, E. S.; Begg, C. B., Two-stage designs for gene-desease association studies with sample size constraints, Biometrics. Journal of the International Biometric Society, 60, 3, 589-597, (2004) · Zbl 1274.62868 |

[8] | Saarela, O.; Kulathinal, S.; Karvanen, J., Secondary analysis under cohort sampling designs using conditional likelihood, Journal of Probability and Statistics, (2012) · Zbl 1246.62220 |

[9] | Saidel, T.; Adhikary, R.; Mainkar, M.; Dale, J.; Loo, V.; Rahman, M.; Ramesh, B. M.; Paranjape, R. S., Baseline integrated behavioural and biological assessment among most at-risk populations in six high-prevalence states of India: Design and implementation challenges, AIDS, 22, 5, S17-S34, (2008) |

[10] | Mills, T. C.; Stall, R.; Pollack, L.; Paul, J. P.; Binson, D.; Canchola, J.; Catania, J. A., Health-related characteristics of men who have sex with men: A comparison of those living in “gay ghettos” with those living elsewhere, American Journal of Public Health, 91, 6, 980-983, (2001) |

[11] | Kendall, C.; Kerr, L. R. F. S.; Gondim, R. C.; Werneck, G. L.; Macena, R. H. M.; Pontes, M. K.; Johnston, L. G.; Sabin, K.; McFarland, W., An empirical comparison of respondent-driven sampling, time location sampling, and snowball sampling for behavioral surveillance in men who have sex with men, Fortaleza, Brazil, AIDS and Behavior, 12, 1, S97-S104, (2008) |

[12] | Zadrozny, B., Learning and evaluating classifiers under sample selection bias, Proceedings of the 21th International Conference on Machine Learning (ICML ’04) |

[13] | Heckman, J. J., Sample selection bias as a specification error, Econometrica, 47, 1, 153-161, (1979) · Zbl 0392.62093 |

[14] | Cortes, C.; Mohri, M.; Riley, M.; Rostamizadeh, A., Sample selection bias correction theory, Algorithmic learning theory. Algorithmic learning theory, Lecture Notes in Comput. Sci., 5254, 38-53, (2008), Springer, Berlin · Zbl 1156.68524 |

[15] | King, G.; Zeng, L., Logistic regression in rare events data, Political Analysis, 9, 2, 137-163, (2001) |

[16] | Lumley, T., Analysis of complex survey samples, Journal of Statistical Software, 9, 1-19, (2004) |

[17] | Dumouchel, W. H.; Duncan, G. J., Using sample survey weights in multiple regression analyses of stratified samples, Journal of the American Statistical Association, 78, 383, 535-543, (1983) · Zbl 0533.62011 |

[18] | Zadrozny, B.; Langford, J.; Abe, N., Cost-sensitive learning by cost-proportionate example weighting, Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM ’03) |

[19] | Fan, W.; Davidson, I., On sample selection bias and its efficient correction via model averaging and unlabeled examples, Proceedings of the 7th SIAM International Conference on Data Mining (SIAM ’07) |

[20] | Elkan, C., The foundations of cost-sensitive learning, Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI ’01) |

[21] | Horvitz, D. G.; Thompson, D. J., A generalization of sampling without replacement from a finite universe, Journal of the American Statistical Association, 47, 663-685, (1952) · Zbl 0047.38301 |

[22] | Robins, J. M.; Rotnitzky, A.; Zhao, L. P., Estimation of regression coefficients when some regressors are not always observed, Journal of the American Statistical Association, 89, 427, 846-866, (1994) · Zbl 0815.62043 |

[23] | Breiman, L., Bagging predictors, Machine Learning, 24, 2, 123-140, (1996) · Zbl 0858.68080 |

[24] | Nahorniak, M.; Larsen, D. P.; Volk, C.; Jordan, C. E., Using inverse probability bootstrap sampling to eliminate sample induced bias in model based analysis of unequal probability samples, PLoS ONE, 10, 6, (2015) |

[25] | Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P., SMOTE: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research, 16, 321-357, (2002) · Zbl 0994.68128 |

[26] | Fahrmeir, L.; Kneib, T.; Lang, S., Regression, Statistik und ihre Anwendungen, (2009), Berlin, Heidelberg, Germany: Springer Berlin Heidelberg, Berlin, Heidelberg, Germany · Zbl 1258.62076 |

[27] | Breiman, L., Random forests, Machine Learning, 45, 1, 5-32, (2001) · Zbl 1007.68152 |

[28] | Hastie, T.; Tibshirani, R.; Friedman, J., The Elements of Statistical Learning, (2001), New York, NY, USA: Springer, New York, NY, USA |

[29] | Fawcett, T., An introduction to ROC analysis, Pattern Recognition Letters, 27, 8, 861-874, (2006) |

[30] | Core Team, R., R: A Language and Environment for Statistical Computing, (2015), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria |

[31] | Wright, M. N.; Ziegler, A., ranger: a fast implementation of random forests for high dimensional data in C++ and R, Journal of Statistical Software, 77, 1, 1-17, (2017) |

[32] | Meyer, D.; Dimitriadou, E. K.; Hornik, A.; Weingessel.; Leisch, F., e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), (2015), Vienna, Austria: TU Wien, Vienna, Austria |

[33] | Siriseriwan, W., smotefamily: A Collection of Oversampling Techniques for Class Imbalance Problem Based on SMOTE |

[34] | Robin, X.; Turck, N.; Hainard, A.; Tiberti, N.; Lisacek, F.; Sanchez, J.; Müller, M., pROC: an open-source package for R and S+ to analyze and compare ROC curves, BMC Bioinformatics, 12, 1, (2011) |

[35] | Sing, T.; Sander, O.; Beerenwinkel, N.; Lengauer, T., ROCR: visualizing classifier performance in R, Bioinformatics, 21, 20, 3940-3941, (2005) |

[36] | Vanschoren, J.; van Rijn, J. N.; Bischl, B.; Torgo, L., OpenML: Networked Science in Machine Learning, SIGKDD Explorations, 15, 2, 49-60, (2014) |

[37] | DeLong, E. R.; DeLong, D. M.; Clarke-Pearson, D. L., Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach, Biometrics, 44, 3, 837-845, (1988) · Zbl 0715.62207 |

This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.