×

A Bayesian approach to disease clustering using restricted Chinese restaurant processes. (English) Zbl 1439.62231

Summary: Identifying disease clusters (areas with an unusually high incidence of a particular disease) is a common problem in epidemiology and public health. We describe a Bayesian nonparametric mixture model for disease clustering that constrains clusters to be made of adjacent areal units. This is achieved by modifying the exchangeable partition probability function associated with the Ewen’s sampling distribution. We call the resulting prior the Restricted Chinese Restaurant Process, as the associated full conditional distributions resemble those associated with the standard Chinese Restaurant Process. The model is illustrated using synthetic data sets and in an application to oral cancer mortality in Germany.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)
PDFBibTeX XMLCite
Full Text: DOI Euclid

References:

[1] Alquier, P., Friel, N., Everitt, R. & Boland, A. (2016). Noisy Monte Carlo: Convergence of Markov chains with approximate transition kernels., Statistics and Computing 26 29-47. · Zbl 1342.60122 · doi:10.1007/s11222-014-9521-x
[2] Anderson, C., Lee, D. & Dean, N. (2014). Identifying clusters in Bayesian disease mapping., Biostatistics 15 457-469.
[3] Antoniak, C. (1974). Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems., Annals of Statistics 2 1152-1174. · Zbl 0335.60034 · doi:10.1214/aos/1176342871
[4] Banerjee, S., Carlin, B. P. & Gelfand, A. E. (2014)., Hierarchical modeling and analysis for spatial data. Chapman and Hall/CRC. · Zbl 1358.62009
[5] Besag, J. & Newell, J. (1991). The detection of clusters in rare diseases., Journal of the Royal Statistical Society. Series A (Statistics in Society) 143-155.
[6] Blackwell, D. & MacQueen, J. B. (1973). Ferguson Distribution via Pólya Urn Schemes., The Annals of Statistics 1 353-355. · Zbl 0276.62010 · doi:10.1214/aos/1176342372
[7] Blei, D. M. & Frazier, P. I. (2011). Distance dependent Chinese restaurant processes., Journal of Machine Learning Research 12 2461-2488. · Zbl 1280.68157
[8] Charras-Garrido, M., Abrial, D., De Goër, J., Dachian, S. & Peyrard, N. (2012). Classification method for disease risk mapping based on discrete hidden Markov random fields., Biostatistics 13 241-255. · Zbl 1437.62412
[9] Dahl, D. B. (2008). Distance-based probability distribution for set partitions with applications to Bayesian nonparametrics., JSM Proceedings. Section on Bayesian Statistical Science, American Statistical Association, Alexandria, Va.
[10] Dahl, D. B., Day, R. & Tsai, J. W. (2017). Random partition distribution indexed by pairwise information., Journal of the American Statistical Association 112 721-732.
[11] Damien, P., Wakefield, J. & Walker, S. (1999). Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables., Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61 331-344. · Zbl 0913.62028 · doi:10.1111/1467-9868.00179
[12] Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems., Annals of Statistics 1 209-230. · Zbl 0255.62037 · doi:10.1214/aos/1176342360
[13] Fernández, C. & Green, P. J. (2002). Modelling spatially correlated data via mixtures: a Bayesian approach., Journal of the royal statistical society: series B (Statistical methodology) 64 805-826. · Zbl 1067.62029 · doi:10.1111/1467-9868.00362
[14] Fitzpatrick, M. C., Preisser, E. L., Porter, A., Elkinton, J., Waller, L. A., Carlin, B. P. & Ellison, A. M. (2010). Ecological boundary detection using Bayesian areal wombling., Ecology 91 3448-3455.
[15] Fuentes-García, R., Mena, R. H. & Walker, S. G. (2010). A probability for classification based on the Dirichlet process mixture model., Journal of classification 27 389-403. · Zbl 1337.62130 · doi:10.1007/s00357-010-9061-9
[16] Gangnon, R. E. & Clayton, M. K. (2000). Bayesian detection and modeling of spatial disease clustering., Biometrics 56 922-935. · Zbl 1060.62610 · doi:10.1111/j.0006-341X.2000.00922.x
[17] Ghosh, S., Ungureanu, A. B., Sudderth, E. B. & Blei, D. M. (2011). Spatial distance dependent Chinese restaurant processes for image segmentation. In, Advances in Neural Information Processing Systems. 1476-1484.
[18] Gnedin, A. & Pitman, J. (2006). Exchangeable Gibbs partitions and Stirling triangles., Journal of Mathematical sciences 138 5674-5685. · Zbl 1293.60010
[19] Gómez-Rubio, V., Ferrándiz-Ferragud, J. & López-Quílez, A. (2005). Detecting clusters of disease with R., Journal of Geographical Systems 7 189-206.
[20] Gómez-Rubio, V., Molitor, J. & Moraga, P. (2018). Fast Bayesian classification for disease mapping and the detection of disease clusters. In, Quantitative Methods in Environmental and Climate Research. Springer, 1-27.
[21] Goujon-Bellec, S., Demoury, C., Guyot-Goubin, A., Hémon, D. & Clavel, J. (2011). Detection of clusters of a rare disease over a large territory: performance of cluster detection methods., International journal of health geographics 10 53.
[22] Green, P. J. & Richardson, S. (2002). Hidden Markov models and disease mapping., Journal of the American statistical association 97 1055-1070. · Zbl 1046.62117 · doi:10.1198/016214502388618870
[23] Guhaniyogi, R. (2017). Bayesian nonparametric areal wombling for small-scale maps with an application to urinary bladder cancer data from Connecticut., Statistics in medicine 36 4007-4027.
[24] Hartigan, J. A. (1990). Partition models., Communications in statistics-Theory and methods 19 2745-2756.
[25] Heinzl, F. & Tutz, G. (2014). Clustering in linear-mixed models with a group fused lasso penalty., Biometrical Journal 56 44-68. · Zbl 1280.62076 · doi:10.1002/bimj.201200111
[26] Hubert, L. & Arabie, P. (1985). Comparing partitions., Journal of classification 2 193-218. · Zbl 0587.62128
[27] Knorr-Held, L. & Raßer, G. (2000). Bayesian detection of clusters and discontinuities in disease maps., Biometrics 56 13-21. · Zbl 1060.62629 · doi:10.1111/j.0006-341X.2000.00013.x
[28] Kulldorff, M. (1997). A spatial scan statistic., Communications in Statistics-Theory and methods 26 1481-1496. · Zbl 0920.62116 · doi:10.1080/03610929708831995
[29] Kulldorff, M. & Nagarwalla, N. (1995). Spatial disease clusters: Detection and inference., Statistics in Medicine 14 799-810. URL http://dx.doi.org/10.1002/sim.4780140809.
[30] Kulldorff, M., Tango, T. & Park, P. J. (2003). Power comparisons for disease clustering tests., Computational Statistics & Data Analysis 42 665-684. · Zbl 1429.62558 · doi:10.1016/S0167-9473(02)00160-3
[31] Lau, J. W. & Green, P. J. (2007). Bayesian model-based clustering procedures., Journal of Computational and Graphical Statistics 16 526-558.
[32] Lee, J., Quintana, F. A., Müller, P. & Trippa, L. (2013). Defining predictive probability functions for species sampling models., Statistical science: a review journal of the Institute of Mathematical Statistics 28 209. · Zbl 1331.62152
[33] Li, C., Phung, D., Rana, S. & Venkatesh, S. (2013). Exploiting side information in distance dependent chinese restaurant processes for data clustering. In, 2013 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1-6.
[34] Li, C., Rana, S., Phung, D. & Venkatesh, S. (2014). Regularizing topic discovery in EMRS with side information by using hierarchical Bayesian models. In, 2014 22nd International Conference on Pattern Recognition. IEEE, 1307-1312.
[35] Li, C., Rana, S., Phung, D. & Venkatesh, S. (2015a). Small-variance asymptotics for Bayesian nonparametric models with constraints. In, Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 92-105.
[36] Li, C., Rana, S., Phung, D. & Venkatesh, S. (2016a). Data clustering using side information dependent Chinese restaurant processes., Knowledge and information systems 47 463-488.
[37] Li, C., Rana, S., Phung, D. & Venkatesh, S. (2016b). Dirichlet Process Mixture Models with Pairwise Constraints for Data Clustering., Annals of data science 3 205-223.
[38] Li, C., Rana, S., Phung, D. & Venkatesh, S. (2016c). Hierarchical Bayesian nonparametric models for knowledge discovery from electronic medical records., Knowledge-Based Systems 99 168-182.
[39] Li, C. Y. (2015)., Exploiting side information in Bayesian nonparametric models and their applications. Ph.D. thesis, Deakin University.
[40] Li, P., Banerjee, S., Hanson, T. A. & McBean, A. M. (2015b). Bayesian models for detecting difference boundaries in areal data., Statistica Sinica 385-402. · Zbl 1480.62190
[41] Loschi, R. H. & Cruz, F. R. (2005). Extension to the product partition model: computing the probability of a change., Computational Statistics & Data Analysis 48 255-268. · Zbl 1429.62084 · doi:10.1016/j.csda.2004.03.003
[42] Lu, H. & Carlin, B. P. (2005). Bayesian areal wombling for geographical boundary analysis., Geographical Analysis 37 265-285.
[43] Lu, H., Reilly, C. S., Banerjee, S. & Carlin, B. P. (2007). Bayesian areal wombling via adjacency modeling., Environmental and Ecological Statistics 14 433-452.
[44] MacEachern, S. N. & Müller, P. (1998). Estimating mixture of Dirichlet process models., Journal of Computational and Graphical Statistics 7 223-338.
[45] Martínez, A. F., Mena, R. H. et al. (2014). On a nonparametric change point detection model in Markovian regimes., Bayesian Analysis 9 823-858. · Zbl 1327.62450 · doi:10.1214/14-BA878
[46] Moraga, P. & Montes, F. (2011). Detection of spatial disease clusters with LISA functions., Statistics in medicine 30 1057-1071.
[47] Morton-Jones, T., Diggle, P. & Elliott, P. (1999). Investigation of excess environmental risk around putative sources: Stone’s test with covariate adjustment., Statistics in medicine 18 189-197.
[48] Müller, P., Quintana, F. & Rosner, G. L. (2011). A product partition model with regression on covariates., Journal of Computational and Graphical Statistics 20 260-278.
[49] Neal, R. (2000). Markov chain sampling methods for Dirichlet process mixture models., Journal of Computational and Graphical Statistics 9 249-265.
[50] Openshaw, S., Charlton, M., Wymer, C. & Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point data sets., International Journal of Geographical Information System 1 335-358.
[51] Page, G. L., Quintana, F. A. et al. (2016). Spatial product partition models., Bayesian Analysis 11 265-298. · Zbl 1359.62401 · doi:10.1214/15-BA971
[52] Pitman, J. (1995). Exchangeable and partially exchangeable random partitions., Probability theory and related fields 102 145-158. · Zbl 0821.60047 · doi:10.1007/BF01213386
[53] Pitman, J. (1996). Some developments of the blackwell-macqueen urn scheme., Lecture Notes-Monograph Series 245-267.
[54] Pitman, J. & Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator., The Annals of Probability 25 855-900. · Zbl 0880.60076 · doi:10.1214/aop/1024404422
[55] Plummer, M., Best, N., Cowles, K. & Vines, K. (2009)., CODA: Output analysis and diagnostics for MCMC. R package version 0.13-4.
[56] Potthoff, R. F. & Whittinghill, M. (1966a). Testing for homogeneity: I. the binomial and multinomial distributions., Biometrika 53 167-182. · Zbl 0142.15303
[57] Potthoff, R. F. & Whittinghill, M. (1966b). Testing for homogeneity: Ii. the Poisson distribution., Biometrika 183-190. · Zbl 0142.15303
[58] Robert, C. P. & Casella, G. (2005)., Monte Carlo statistical methods (Springer Texts in Statistics). Secaucus, NJ, USA: Springer-Verlag. · Zbl 1096.62003
[59] Rodríguez, A. & Quintana, F. A. (2015). On species sampling sequences induced by residual allocation models., Journal of statistical planning and inference 157 108-120. · Zbl 1364.62067
[60] Smith, A. F. & Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods., Journal of the Royal Statistical Society. Series B (Methodological) 3-23. · Zbl 0779.62030 · doi:10.1111/j.2517-6161.1993.tb01466.x
[61] Stone, R. A. (1988). Investigations of excess environmental risks around putative sources: statistical problems and a proposed test., Statistics in Medicine 7 649-660.
[62] Tango, T. (1995). A class of tests for detecting ‘general’ and ‘focused’ clustering of rare diseases., Statistics in Medicine 14 2323-2334.
[63] Tango, T. & Takahashi, K. (2005). A flexibly shaped spatial scan statistic for detecting clusters., International journal of health geographics 4 11.
[64] Wakefield, J. & Kim, A. (2013). A Bayesian model for cluster detection., Biostatistics 14 752-765.
[65] Waller, L. A., Hill, E. G. & Rudd, R. A. (2006). The geography of power: statistical performance of tests of clusters and clustering in heterogeneous populations., Statistics in Medicine 25 853-865.
[66] Wang, H. & Rodríguez, A. (2014). Identifying Pediatric Cancer Clusters in Florida Using Log-Linear Models and Generalized Lasso Penalties., Statistics and Public Policy 1 86-96.
[67] Wehrhahn, C., Leonard, S., Rodriguez, A. & Xifara, T. (2020). Supplementary material to: “Bayesian approach to Disease Clustering using restricted Chinese restaurant processes”. DOI:, 10.1214/20-EJS1696SUPP. · Zbl 1439.62231 · doi:10.1214/20-EJS1696
[68] Weinstock, M. A. (1981). A generalised scan statistic test for the detection of clusters., International Journal of Epidemiology 10 289-293.
[69] Whittemore, A. · Zbl 0628.62103 · doi:10.1093/biomet/74.3.631
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.