×

Semiparametric maximum likelihood estimation with two-phase stratified case-control sampling. (English) Zbl 07769941

Summary: We develop statistical inference methods for fitting logistic regression models to data arising from the two-phase stratified case-control sampling design, where a subset of covariates are available only for a portion of cases and controls, who are selected based on the case-control status and fully collected covariates. In addition, we characterize the distribution of incomplete covariates, conditional on fully observed ones. Here, we include all subjects in the analysis in order to achieve consistency in the parameter estimation and optimal statistical efficiency. We develop a semiparametric maximum likelihood approach under the rare disease assumption, where the parameter estimates are obtained using a novel reparametrized profile likelihood technique. We study the large-sample distribution theory for the proposed estimator, and use simulations to demonstrate that it performs well in finite samples and improves on the statistical efficiency of existing approaches. We apply the proposed method to analyze a stratified case-control study of breast cancer nested within the Breast Cancer Detection and Demonstration Project, where one breast cancer risk predictor, namely, percent mammographic density, was measured only for a subset of the women in the study.

MSC:

62-XX Statistics
PDFBibTeX XMLCite
Full Text: DOI

References:

[1] Breslow, N. E. and Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika 75, 11-20. · Zbl 0635.62110
[2] Breslow, N. E. and Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to wilms tumour prognosis. Applied Statistics 48, 457-468. · Zbl 0957.62091
[3] Breslow, N. E. and Holubkov, R. (1997a). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. Journal of the Royal Statistical Society, Series B (Methodological) 59, 447-461. · Zbl 0886.62071
[4] Breslow, N. E. and Holubkov, R. (1997b). Weighted likelihood, pseudo-likelihood and max-imum likelihood methods for logistic regression analysis of two-stage data. Statistics in Medicine 16, 103-116.
[5] Byrne, C., Schairer, C., Wolfe, J., Parekh, N., Salane, M., Brinton, L. A. et al. (1995). Mam-mographic features and breast cancer risk: Effects with time, age, and menopause status. Journal of the National Cancer Institute 87, 1622-1629.
[6] Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation ex-ploiting gene-environment independence in case-control studies. Biometrika 92, 399-418. · Zbl 1094.62136
[7] Chatterjee, N. and Chen, Y.-H. (2007). Maximum likelihood inference on a mixed condition-ally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. Journal of the Royal Statistical Society, Series B (Statistical Methodol-ogy) 69, 123-142. · Zbl 1120.62096
[8] Chatterjee, N., Spinka, C., Chen, J. and Carroll, R. J. (2006). Comment. Journal of the Amer-ican Statistical Association 101, 108-111.
[9] Chen, H. Y. and Chen, J. (2011). On information coded in gene-environment independence in case-control studies. American Journal of Epidemiology 174, 736-743.
[10] Chen, J., Ayyagari, R., Chatterjee, N., Pee, D., Schairer, C., Byrne, C. et al. (2008). Breast cancer relative hazard estimates from case-control and cohort designs with missing data on mammographic density. Journal of the American Statistical Association 103, 976-988. · Zbl 1205.62163
[11] Chen, J., Pee, D., Ayyagari, R., Graubard, B., Schairer, C., Byrne, C. et al. (2006). Projecting absolute invasive breast cancer risk in white women with a model that includes mammo-graphic density. Journal of the National Cancer Institute 98, 1215-1226.
[12] Chen, Y.-H., Chatterjee, N. and Carroll, R. J. (2007). Retrospective analysis of haplotype-based case-control studies under a flexible model for gene-environment association. Biostatis-tics 9, 81-99. · Zbl 1274.62743
[13] Fears, T. R. and Brown, C. C. (1986). Logistic regression methods for retrospective case-control studies using complex sampling procedures. Biometrics 42, 955-960. · Zbl 0624.62101
[14] Gail, M. H., Brinton, L. A., Byar, D. P., Corle, D. K., Green, S. B., Schairer, C. et al. (1989). Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute 81, 1879-1886.
[15] Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 61, 413-438. · Zbl 0915.62030
[16] Little, R. J. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. John Wiley & Sons. · Zbl 0665.62004
[17] Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene-environment independence for analysis of case-control studies: An empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64, 685-694. · Zbl 1190.62185
[18] Neyman, J. (1938). Contribution to the theory of sampling from human populations. Journal of the American Statistical Association 33, 101-116. · Zbl 0018.22603
[19] Prentice, R. L. and Pyke, R. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403-411. · Zbl 0428.62078
[20] Roeder, K., Carroll, R. J. and Lindsay, B. G. (1996). A semiparametric mixture approach to case-control studies with errors in covariables. Journal of the American Statistical Associ-ation 91, 722-732. · Zbl 0869.62081
[21] Satten, G. and Kupper, L. (1993). Inferences about exposure-disease associations using probability-of-exposure information. Journal of the American Statistical Association 88, 200-208. · Zbl 0775.62311
[22] Schill, W., Jockel, K.-H., Drescher, K. and Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika 80, 339-352. · Zbl 0783.62097
[23] Scott, A. J. and Wild, C. (2001). Maximum likelihood for generalized case-control studies. Journal of Statistical Planning and Inference 96, 3-27. · Zbl 0976.62105
[24] Scott, A. J. and Wild, C. J. (1991). Fitting logistic regression models in stratified case-control studies. Biometrics 47, 497-510. · Zbl 0736.62093
[25] Scott, A. J. and Wild, C. J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 84, 57-71. · Zbl 1058.62505
[26] Song, R., Zhou, H. and Kosorok, M. R. (2009). A note on semiparametric efficient inference for two-stage outcome-dependent sampling with a continuous outcome. Biometrika 96, 221-228. · Zbl 1163.62088
[27] Spinka, C., Carroll, R. J. and Chatterjee, N. (2005). Analysis of case-control studies of genetic and environmental factors with missing genetic information and haplotype-phase ambiguity. Genetic Epidemiology 29, 108-127.
[28] Tao, R., Zeng, D. and Lin, D.-Y. (2017). Efficient semiparametric inference under two-phase sampling, with applications to genetic association studies. Journal of the American Statis-tical Association 112, 1468-1476.
[29] White, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. American Journal of Epidemiology 115, 119-128.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.