zbMATH — the first resource for mathematics

Entity resolution with empirically motivated priors. (English) Zbl 1335.62023
Summary: Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a bipartite graph that links the observed records to unobserved latent entities. Bayesian approaches provide attractive benefits, naturally providing uncertainty quantification via posterior probabilities. We propose a novel record linkage approach based on empirical Bayesian principles. Specifically, the empirical Bayesian-type step consists of taking the empirical distribution function of the data as the prior for the latent entities. This approach improves on the earlier HB approach not only by avoiding the prior specification problem but also by allowing both categorical and string-valued variables. Our extension to string-valued variables also involves the proposal of a new probabilistic mechanism by which observed record values for string fields can deviate from the values of their associated latent entities. Categorical fields that deviate from their corresponding true value are simply drawn from the empirical distribution function. We apply our proposed methodology to a simulated data set of German names and an Italian household survey on income and wealth, showing our method performs favorably compared to several standard methods in the literature. We also consider the robustness of our methods to changes in the hyper-parameters.

62C10 Bayesian problems; characterization of Bayes procedures
62H30 Classification and discrimination; cluster analysis (statistical aspects)
BayesTree; BartPy
Full Text: DOI Euclid
[1] Belin, T. R. and Rubin, D. B. (1995). “A method for calibrating false-match rates in record linkage.” Journal of the American Statistical Association , 90(430): 694-707. · Zbl 0925.62548
[2] Bhattacharya, I. and Getoor, L. (2006). “A Latent Dirichlet Model for Unsupervised Entity Resolution.” In: SDM , volume 5, 59. SIAM.
[3] Breiman, L. (2001). “Random forests.” Machine Learning , 45(1): 5-32. · Zbl 1007.68152
[4] Broderick, T., Boyd, N., Wibisono, A., Wilson, A. C., and Jordan, M. I. (2013). “Streaming variational Bayes.” In: Advances in Neural Information Processing Systems , 1727-1735.
[5] Broderick, T. and Steorts, R. (2014). “Variational Bayes for Merging Noisy Databases.” Advances in Variational Inference NIPS 2014 Workshop . arXiv:1410.4792
[6] Carlin, B. P. and Louis, T. A. (2000). Bayes and Empirical Bayes Methods for Data Analysis (2nd ed.) . Chapman & Hall/CRC. · Zbl 1017.62005
[7] Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). “BART: Bayesian additive regression trees.” The Annals of Applied Statistics , 4(1): 266-298. · Zbl 1189.62066
[8] Christen, P. (2005). “Probabilistic Data Generation for Deduplication and Data Linkage.” In: Proceedings of the Sixth International Conference on Intelligent Data Engineering and Automated Learning (IDEAL’05) , 109-116.
[9] - (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection . Springer.
[10] Christen, P. and Pudjijono, A. (2009). “Accurate Synthetic Generation of Realistic Personal Information.” In: Theeramunkong, T., Kijsirikul, B., Cercone, N., and Ho, T.-B. (eds.), Advances in Knowledge Discovery and Data Mining , volume 5476 of Lecture Notes in Computer Science , 507-514. Springer, Berlin, Heidelberg.
[11] Christen, P. and Vatsalan, D. (2013). “Flexible and Extensible Generation and Corruption of Personal Data.” In: Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM 2013) .
[12] Dai, A. M. and Storkey, A. J. (2011). “The grouped author-topic model for unsupervised entity resolution.” In: Artificial Neural Networks and Machine Learning - ICANN 2011 , 241-249. Springer.
[13] Fellegi, I. and Sunter, A. (1969). “A Theory for Record Linkage.” Journal of the American Statistical Association , 64(328): 1183-1210. · Zbl 0186.53903
[14] Gutman, R., Afendulis, C., and Zaslavsky, A. (2013). “A Bayesian Procedure for File Linking to Analyze End- of-Life Medical Costs.” Journal of the American Statistical Association , 108(501): 34-47. · Zbl 1379.62069
[15] Han, H., Giles, L., Zha, H., Li, C., and Tsioutsiouliklis, K. (2004). “Two supervised learning approaches for name disambiguation in author citations.” In: Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004 , 296-305. IEEE.
[16] Larsen, M. D. and Rubin, D. B. (2001). “Iterative automated record linkage using mixture models.” Journal of the American Statistical Association , 96(453): 32-41.
[17] Liseo, B. and Tancredi, A. (2013). “Some advances on Bayesian record linkage and inference for linked data.”
[18] Martins, B. (2011). “A Supervised Machine Learning Approach for Duplicate Detection for Gazetteer Records.” Lecture Notes in Computer Science , 6631: 34-51.
[19] Robbins, H. (1956). “An empirical Bayes approach to statistics.” In: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theorem of Statistics , 157-163. MR. · Zbl 0074.35302
[20] Sadinle, M. (2014). “Detecting Duplicates in a Homicide Registry Using a Bayesian Partitioning Approach.” . arXiv:1407.8219 · Zbl 1454.62506
[21] Sadinle, M. and Fienberg, S. (2013). “A Generalized Fellegi-Sunter Framework for Multiple Record Linkage with Application to Homicide Record-Systems.” Journal of the American Statistical Association , 108(502): 385-397. · Zbl 06195947
[22] Steorts, R., Ventura, S., Sadinle, M., and Fienberg, S. (2014a). “A Comparison of Blocking Methods for Record Linkage.” In: Privacy in Statistical Databases , 253-268. Springer.
[23] Steorts, R. C., Hall, R., and Fienberg, S. (2014b). “SMERED: A Bayesian Approach to Graphical Record Linkage and De-duplication.” JMLR W&CP , 33: 922-930. arXiv:1403.0211
[24] - (2015). “A Bayesian Approach to Graphical Record Linkage and De-duplication.” Minor Revision, Journal of the American Statistical Association . arXiv:1312.4645
[25] Tancredi, A. and Liseo, B. (2011). “A hierarchical Bayesian approach to record linkage and population size problems.” Annals of Applied Statistics , 5(2B): 1553-1585. · Zbl 1223.62015
[26] Torvik, V. I. and Smalheiser, N. R. (2009). “Author name disambiguation in MEDLINE.” ACM Transactions on Knowledge Discovery from Data (TKDD) , 3(3): 11.
[27] Treeratpituk, P. and Giles, C. L. (2009). “Disambiguating authors in academic publications using random forests.” In: Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries , 39-48. ACM.
[28] Ventura, S. (2013). “Large-Scale Clustering Methods with Applications to Record Linkage.” PhD thesis proposal, CMU, Pittsburgh, PA.
[29] Wainwright, M. J. and Jordan, M. I. (2008). “Graphical models, exponential families, and variational inference.” Foundations and Trends in Machine Learning , 1(1-2): 1-305. · Zbl 1193.62107
[30] Wallach, H. M., Jensen, S., Dicker, L., and Heller, K. A. (2010). “An Alternative Prior Process for Nonparametric Bayesian Clustering.” In: International Conference on Artificial Intelligence and Statistics , 892-899.
[31] Winkler, W. E. (2006). “Overview of record linkage and current research directions.” In: Bureau of the Census . Citeseer.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.