Clustering the prevalence of pediatric chronic conditions in the United States using distributed computing. (English) Zbl 1405.62224

Summary: This research paper presents an approach to clustering the prevalence of chronic conditions among children with public insurance in the United States. The data consist of prevalence estimates at the community level for 25 pediatric chronic conditions. We employ a spatial clustering algorithm to identify clusters of communities with similar chronic condition prevalences. The primary challenge is the computational effort needed to estimate the spatial clustering for all communities in the U.S. To address this challenge, we develop a distributed computing approach to spatial clustering. Overall, we found that the burden of chronic conditions in rural communities tends to be similar but with wide differences in urban communities. This finding suggests similar interventions for managing chronic conditions in rural communities but targeted interventions in urban areas.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62H30 Classification and discrimination; cluster analysis (statistical aspects)


spatial; GMRFLib; Julia
Full Text: DOI Euclid


[1] Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the Spring Joint Computer Conference 483–485. ACM, New York.
[2] Besag, J. (1986). On the statistical analysis of dirty pictures. J. Roy. Statist. Soc. Ser. B48 259–302. · Zbl 0609.62150
[3] Besag, J. and Newell, J. (1991). The detection of clusters in rare diseases. J. Roy. Statist. Soc. Ser. A154 143–155.
[4] Bezanson, J., Edelman, A., Karpinski, S. and Shah, V. B. (2017). Julia: A fresh approach to numerical computing. SIAM Rev.59 65–98. · Zbl 1356.68030
[5] Birant, D. and Kut, A. (2007). ST-DBSCAN: An algorithm for clustering spatial-temporal data. Data Knowl. Eng.60 208–221.
[6] Cameron, E., Battle, K. E., Bhatt, S., Weiss, D. J., Bisanzio, D., Mappin, B., Dalrymple, U., Hay, S. I., Smith, D. L., Griffin, J. T. et al. (2015). Defining the relationship between infection prevalence and clinical incidence of Plasmodium falciparum malaria. Nat. Commun.6 Art. ID 8170.
[7] Carson, C., Belongie, S., Greenspan, H. and Malik, J. (2002). Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell.24 1026–1038.
[8] Center for Medicare and Medicaid Services (2017a). September 2017 Medicaid and CHIP enrollment data highlights. Available at https://www.medicaid.gov/medicaid/program-information/medicaid-and-chip-enrollment-data/report-highlights/index.html.
[9] Center for Medicare and Medicaid Services (2017b). Quality of care health disparities. Available at https://www.medicaid.gov/medicaid/quality-of-care/improvement-initiatives/health-disparities/index.html.
[10] Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Olukotun, K. and Ng, A. Y. (2007). Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems 281–288.
[11] Cockerham, W. C., Hamby, B. W. and Oates, G. R. (2017). The social determinants of chronic disease. Am. J. Prev. Med.52 S5–S12.
[12] Cressie, N. A. C. (2015). Statistics for Spatial Data, revised ed. Wiley, New York. Paperback edition of the 1993 edition [MR1239641]. · Zbl 1347.62005
[13] Davila-Payan, C., DeGuzman, M., Johnson, K., Serban, N. and Swann, J. (2015). Estimating prevalence of overweight or obese children and adolescents in small geographic areas using publicly available data. Prev. Chronic Dis.12. DOI:10.5888/pcd12.140229.
[14] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B39 1–38. With discussion. · Zbl 0364.62022
[15] Diggle, P. J. and Giorgi, E. (2016). Model-based geostatistics for prevalence mapping in low-resource settings. J. Amer. Statist. Assoc.111 1096–1120.
[16] Ding, C. and He, X. (2004). K-means clustering via principal component analysis. In Proceedings of the 21st International Conference on Machine Learning 29. ACM, New York.
[17] Elliot, P., Wakefield, J. C., Best, N. G. and Briggs, D. J. (2000). Spatial Epidemiology: Methods and Applications. Oxford Univ. Press, Oxford.
[18] Ester, M., Kriegel, H.-P., Sander, J., Xu, X. et al. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96) 226–231.
[19] Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. J. Amer. Statist. Assoc.97 611–631. · Zbl 1073.62545
[20] Fraley, C. and Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J.41 578–588. · Zbl 0920.68038
[21] Furrer, R., Genton, M. G. and Nychka, D. (2006). Covariance tapering for interpolation of large spatial datasets. J. Comput. Graph. Statist.15 502–523.
[22] Gotway, C. A. and Young, L. J. (2002). Combining incompatible spatial data. J. Amer. Statist. Assoc.97 632–648. · Zbl 1073.62604
[23] Green, P. J. and Richardson, S. (2002). Hidden Markov models and disease mapping. J. Amer. Statist. Assoc.97 1055–1070. · Zbl 1046.62117
[24] Jiang, H. and Serban, N. (2012). Clustering random curves under spatial interdependence with application to service accessibility. Technometrics54 108–119.
[25] Kopec, J. A., Sayre, E. C., Flanagan, W. M., Fines, P., Cibere, J., Rahman, M. M., Bansback, N. J., Anis, A. H., Jordan, J. M., Sobolev, B. et al. (2010). Development of a population-based microsimulation model of osteoarthritis in Canada. Osteoarthr. Cartil.18 303–311.
[26] Kriegel, H.-P., Kröger, P., Sander, J. and Zimek, A. (2011). Density-based clustering. Wiley Interdiscip. Rev. Data Min. Knowl. Discov.1 231–240.
[27] Lawson, A., Biggeri, A., Bohning, D., Lesaffre, E., Viel, J.-F. and Bertollini, R. (1999). Disease Mapping and Risk Assessment for Public Health. Wiley, New York. · Zbl 0942.00010
[28] Liu, Q. and Ihler, A. (2012). Distributed parameter estimation via pseudo-likelihood. In International Conference on Machine Learning (ICML) 1487–1494.
[29] Meyer, S. and Held, L. (2014). Power-law models for infectious disease spread. Ann. Appl. Stat.8 1612–1639. · Zbl 1304.62135
[30] Neff, J. M., Sharp, V. L., Muldoon, J., Graham, J., Popalisky, J. and Gay, J. C. (2002). Identifying and classifying children with chronic conditions using administrative data with the clinical risk group classification system. Ambul. Pediatr.2 71–79.
[31] Openshaw, S., Charlton, M., Wymer, C. and Craft, A. (1987). A mark 1 geographical analysis machine for the automated analysis of point data sets. Int. J. Geogr. Inf. Syst.1 335–358.
[32] Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. J. Amer. Statist. Assoc.66 846–850.
[33] Ripley, B. D. (2005). Spatial Statistics. Wiley, New York. · Zbl 0583.62087
[34] Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications. Monographs on Statistics and Applied Probability104. Chapman & Hall/CRC, Boca Raton, FL. · Zbl 1093.60003
[35] Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. Roy. Statist. Soc. Ser. B71 319–392. · Zbl 1248.62156
[36] Rue, H. and Tjelmeland, H. (2002). Fitting Gaussian Markov random fields to Gaussian fields. Scand. J. Stat.29 31–49. · Zbl 1017.62088
[37] The World Health Organization (2005). Chronic diseases and their common risk factors. Available at http://www.who.int/chp/chronic_disease_report/media/Factsheet1.pdf.
[38] United States Department of Agriculture (2004). Measuring rurality: Rural-urban continuum codes. Available at https://www.ers.usda.gov/data-products/rural-urban-continuum-codes/.
[39] Wakefield, J. C. (2006). Disease mapping and spatial regression with count data. Biostatistics8 158–183. · Zbl 1213.62178
[40] Waller, L. A. and Gotway, C. A. (2004). Applied Spatial Statistics for Public Health Data. Wiley, Hoboken, NJ. · Zbl 1057.62106
[41] Wang, M., Wang, A. and Li, A. (2006). Mining spatial-temporal clusters from geo-databases. In Advanced Data Mining and Applications. Lecture Notes in Artificial Intelligence4093 263–270. Springer, Berlin.
[42] Wolfe, J., Haghighi, A. and Klein, D. (2008). Fully distributed EM for very large datasets. In Proceedings of the 25th International Conference on Machine Learning 1184–1191. ACM, New York.
[43] Zheng, Y. and Serban, N.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.