Bayesian double feature allocation for phenotyping with electronic health records. (English) Zbl 1452.62835

Summary: Electronic health records (EHR) provide opportunities for deeper understanding of human phenotypes – in our case, latent disease – based on statistical modeling. We propose a categorical matrix factorization method to infer latent diseases from EHR data. A latent disease is defined as an unknown biological aberration that causes a set of common symptoms for a group of patients. The proposed approach is based on a novel double feature allocation model which simultaneously allocates features to the rows and the columns of a categorical matrix. Using a Bayesian approach, available prior information on known diseases (e.g., hypertension and diabetes) greatly improves identifiability and interpretability of the latent diseases. We assess the proposed approach by simulation studies including mis-specified models and comparison with sparse latent factor models. In the application to a Chinese EHR dataset, we identify 10 latent diseases, each of which is shared by groups of subjects with specific health traits related to lipid disorder, thrombocytopenia, polycythemia, anemia, bacterial and viral infections, allergy, and malnutrition. The identification of the latent diseases can help healthcare officials better monitor the subjects’ ongoing health conditions and look into potential risk factors and approaches for disease prevention. We cross-check the reported latent diseases with medical literature and find agreement between our discovery and reported findings elsewhere. We provide an R package “dfa” implementing our method and an R shiny web application reporting the findings.


62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
62H30 Classification and discrimination; cluster analysis (statistical aspects)


R; BNPdensity; shiny; BDgraph; dfa
Full Text: DOI arXiv


[1] Barrios, E.; Lijoi, A.; Nieto-Barajas, L. E.; Prünster, I., “Modeling With Normalized Random Measure Mixture Models,”, Statistical Science, 28, 313-334 (2013) · Zbl 1331.62120
[2] Bhattacharya, A.; Dunson, D. B., “Sparse Bayesian Infinite Factor Models,”, Biometrika, 98, 291-306 (2011) · Zbl 1215.62025
[3] Broderick, T.; Jordan, M. I.; Pitman, J., “Cluster and Feature Modeling From Combinatorial Stochastic Processes,”, Statistical Science, 28, 289-312 (2013) · Zbl 1331.62124
[4] Campbell, T.; Cai, D.; Broderick, T., “Exchangeable Trait Allocations,”, Electronic Journal of Statistics, 12, 2290-2322 (2018) · Zbl 1411.62068
[5] Chang, W., Cheng, J., Allaire, J. J., Xie, Y., and McPherson, J. (2015), “shiny: Web Application Framework for R,” R Package Version 0.12.2.
[6] Chen, Y.; Culpepper, S. A.; Chen, Y.; Douglas, J., “Bayesian Estimation of the DINA Q Matrix,”, Psychometrika, 83, 89-108 (2018) · Zbl 1402.62302
[7] Chen, Y.; Liu, J.; Xu, G.; Ying, Z., “Statistical Analysis of Q-Matrix Based Diagnostic Classification Models,”, Journal of the American Statistical Association, 110, 850-866 (2015) · Zbl 1373.62565
[8] Dahl, D. B.; Vannucci, M.; Do, K.-A.; Müller, P., Bayesian Inference for Gene Expression and Proteomics, Model-Based Clustering for Expression Data via a Dirichlet Process Mixture Model (2006), Cambridge, UK: Cambridge University Press, Cambridge, UK
[9] De Boer, I. H.; Bangalore, S.; Benetos, A.; Davis, A. M.; Michos, E. D.; Muntner, P.; Rossing, P.; Zoungas, S.; Bakris, G., “Diabetes and Hypertension: A Position Statement by the American Diabetes Association,”, Diabetes Care, 40, 1273-1284 (2017)
[10] Dobra, A.; Lenkoski, A.; Rodriguez, A., “Bayesian Inference for General Gaussian Graphical Models With Application to Multivariate Lattice Data,”, Journal of the American Statistical Association, 106, 1418-1433 (2011) · Zbl 1234.62018
[11] Favaro, S.; Teh, Y. W., “MCMC for Normalized Random Measure Mixture Models,”, Statistical Science, 28, 335-359 (2013) · Zbl 1331.62138
[12] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y., Generative Adversarial Nets, Advances in Neural Information Processing Systems, 2672-2680 (2014)
[13] Green, P. J.; Thomas, A., “Sampling Decomposable Graphs Using a Markov Chain on Junction Trees,”, Biometrika, 100, 91-110 (2013) · Zbl 1284.62172
[14] Griffiths, T. L.; Ghahramani, Z.; Weiss, Y.; Schölkopf, B.; Platt, J., Advances in Neural Information Processing Systems, Infinite Latent Feature Models and the Indian Buffet Process, 475-482 (2006), Cambridge, MA: MIT Press, Cambridge, MA
[15] Griffiths, T. L.; Ghahramani, Z., “The Indian Buffet Process: An Introduction and Review,”, Journal of Machine Learning Research, 12, 1185-1224 (2011) · Zbl 1280.62038
[16] Guo, L., “Bayesian Biclustering on Discrete Data: Variable Selection Methods (2013), Harvard University
[17] Halpern, Y.; Horng, S.; Choi, Y.; Sontag, D., “Electronic Medical Record Phenotyping Using the Anchor and Learn Framework,”, Journal of the American Medical Informatics Association, 23, 731-740 (2016)
[18] Hartigan, J. A., “Direct Clustering of a Data Matrix,”, Journal of the American Statistical Association, 67, 123-129 (1972)
[19] Henderson, J.; Ho, J. C.; Kho, A. N.; Denny, J. C.; Malin, B. A.; Sun, J.; Ghosh, J., Granite: Diversified, Sparse Tensor Factorization for Electronic Health Record-Based Phenotyping, 2017 IEEE International Conference on Healthcare Informatics (ICHI, 214-223 (2017), IEEE
[20] Huang, Z.; Gelman, A., Technical Report, Department of Statistics, Sampling for Bayesian Computation With Large Datasets (2005), Columbia University
[21] Lau, J. W.; Green, P. J., “Bayesian Model-Based Clustering Procedures,”, Journal of Computational and Graphical Statistics, 16, 526-558 (2007)
[22] Lauritzen, S. L., Graphical Models, 17 (1996), Oxford: Clarendon Press, Oxford · Zbl 0907.62001
[23] Lee, J.; Müller, P.; Gulukota, K.; Ji, Y., “A Bayesian Feature Allocation Model for Tumor Heterogeneity,”, The Annals of Applied Statistics, 9, 621-639 (2015) · Zbl 1397.62457
[24] Li, M.; Hu, Y.; Mao, D.; Wang, R.; Chen, J.; Li, W.; Yang, X.; Piao, J.; Yang, L., “Prevalence of Anemia Among Chinese Rural Residents,”, Nutrients, 9, 192 (2017)
[25] Li, T., A General Model for Clustering Binary Data, Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 188-197 (2005), ACM
[26] Mc Namara, K.; Alzubaidi, H.; Jackson, J. K., “Cardiovascular Disease as a Leading Cause of Death: How Are Pharmacists Getting Involved?,”, Integrated Pharmacy Research & Practice, 8, 1 (2019)
[27] Meeds, E.; Ghahramani, Z.; Neal, R. M.; Roweis, S. T., Modeling Dyadic Data With Binary Latent Factors, Advances in Neural Information Processing Systems, 977-984 (2007)
[28] Miettinen, P.; Mielikäinen, T.; Gionis, A.; Das, G.; Mannila, H., “The Discrete Basis Problem,”, IEEE Transactions on Knowledge and Data Engineering, 20, 1348-1362 (2008)
[29] Miller, J. W.; Harrison, M. T., “Mixture Models With a Prior on the Number of Components,”, Journal of the American Statistical Association, 113, 340-356 (2018) · Zbl 1398.62066
[30] Minsker, S.; Srivastava, S.; Lin, L.; Dunson, D., Scalable and Robust Bayesian Inference via the Median Posterior, International Conference on Machine Learning, 1656-1664 (2014)
[31] Mohammadi, A., and Wit, E. C. (2015), “BDgraph: An R Package for Bayesian Structure Learning in Graphical Models,” arXiv no. 1501.05108.
[32] Mukai, K.; Galli, S. J., Basophils (2013), Atlanta, GA: American Cancer Society, Atlanta, GA
[33] Ni, Y., Müller, P., Diesendruck, M., Williamson, S., Zhu, Y., and Ji, Y. (2018), “Scalable Bayesian Nonparametric Clustering and Classification,” arXiv no. 1806.02670. DOI: .
[34] Pontes, B.; Giráldez, R.; Aguilar-Ruiz, J. S., “Biclustering on Expression Data: A Review,”, Journal of Biomedical Informatics, 57, 163-180 (2015)
[35] Qian, H.; Zheng, M., Prevalence of Allergic Diseases in China, 3-17 (2012), Berlin, Heidelberg: Springer Berlin Heidelberg, Berlin, Heidelberg
[36] Rabinovich, M.; Angelino, E.; Jordan, M. I., Variational Consensus Monte Carlo, Advances in Neural Information Processing Systems, 1207-1215 (2015)
[37] Richardson, S.; Green, P. J., “On Bayesian Analysis of Mixtures With an Unknown Number of Components” (with discussion),, Journal of the Royal Statistical Society, Series B, 59, 731-792 (1997) · Zbl 0891.62020
[38] Ročková, V.; George, E. I., “Fast Bayesian Factor Analysis via Automatic Rotations to Sparsity,”, Journal of the American Statistical Association, 111, 1608-1622 (2016)
[39] Ross, M. K.; Wei, W.; Ohno-Machado, L., ‘Big Data’ and the Electronic Health Record,”, Yearbook of Medical Informatics, 9, 97 (2014)
[40] Rukat, T., Holmes, C. C., Titsias, M. K., and Yau, C. (2017), “Bayesian Boolean Matrix Factorisation,” arXiv no. 1702.06166.
[41] Scherpbier, R., “China’s Progress and Challenges in Improving Child Nutrition,”, Biomedical and Environmental Sciences, 29, 163-164 (2016)
[42] Scott, J. G.; Berger, J. O., “Bayes and Empirical-Bayes Multiplicity Adjustment in the Variable-Selection Problem,”, The Annals of Statistics, 38, 2587-2619 (2010) · Zbl 1200.62020
[43] Scott, S. L.; Blocker, A. W.; Bonassi, F. V.; Chipman, H. A.; George, E. I.; McCulloch, R. E., “Bayes and Big Data: The Consensus Monte Carlo Algorithm,”, International Journal of Management Science and Engineering Management, 11, 78-88 (2016)
[44] van Uitert, M.; Meuleman, W.; Wessels, L., “Biclustering Sparse Binary Genomic Data,”, Journal of Computational Biology, 15, 1329-1345 (2008)
[45] Wei, J.-M.; Li, S.; Claytor, L.; Partridge,., J.; Goates, S., “Prevalence and Predictors of Malnutrition in Elderly Chinese Adults: Results From the China Health and Retirement Longitudinal Study,”, Public Health Nutrition, 21, 3129-3134 (2018)
[46] Wood, F.; Griffiths, T. L.; Ghahramani, Z., A Non-Parametric Bayesian Method for Inferring Hidden Causes, Proceedings of the Conference on Uncertainty in Artificial Intelligence, 22 (2006)
[47] Xu, Y.; Lee, J.; Yuan, Y.; Mitra, R.; Liang, S.; Müller, P.; Ji, Y., “Nonparametric Bayesian Bi-Clustering for Next Generation Sequencing Count Data,”, Bayesian Analysis, 8, 759-780 (2013) · Zbl 1329.62180
[48] Zhang, F.-L.; Guo, Z.-N.; Xing, Y.-Q.; Wu, Y.-H.; Liu, H.-Y.; Yang, Y., “Hypertension Prevalence, Awareness, Treatment, and Control in Northeast China: A Population-Based Cross-Sectional Survey,”, Journal of Human Hypertension, 32, 54-65 (2017)
[49] Zhang, L.; Wang, F.; Wang, L.; Wang, W.; Liu, B.; Liu, J.; Chen, M.; He, Q.; Liao, Y.; Yu, X.; Chen, N., “Prevalence of Chronic Kidney Disease in China: A Cross-Sectional Survey,”, The Lancet, 379, 815-822 (2012)
[50] Zhang, Z.; Li, T.; Ding, C.; Zhang, X., Binary Matrix Factorization With Applications, Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007, 391-400 (2007), IEEE
[51] Zhao, D.; Liu, J.; Wang, M.; Zhang, X.; Zhou, M., “Epidemiology of Cardiovascular Disease in China: Current Features and Implications,”, Nature Reviews Cardiology, 16, 203-212 (2018)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.