×

Detecting duplicates in a homicide registry using a Bayesian partitioning approach. (English) Zbl 1454.62506

Summary: Finding duplicates in homicide registries is an important step in keeping an accurate account of lethal violence. This task is not trivial when unique identifiers of the individuals are not available, and it is especially challenging when records are subject to errors and missing values. Traditional approaches to duplicate detection output independent decisions on the coreference status of each pair of records, which often leads to nontransitive decisions that have to be reconciled in some ad-hoc fashion. The task of finding duplicate records in a data file can be alternatively posed as partitioning the data file into groups of coreferent records. We present an approach that targets this partition of the file as the parameter of interest, thereby ensuring transitive decisions. Our Bayesian implementation allows us to incorporate prior information on the reliability of the fields in the data file, which is especially useful when no training data are available, and it also provides a proper account of the uncertainty in the duplicate detection decisions. We present a study to detect killings that were reported multiple times to the United Nations Truth Commission for El Salvador.

MSC:

62P25 Applications of statistics to social sciences

Software:

RecordLinkage; R; igraph
PDFBibTeX XMLCite
Full Text: DOI arXiv Euclid

References:

[1] Bilenko, M., Mooney, R. J., Cohen, W. W., Ravikumar, P. and Fienberg, S. E. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems 18 16-23.
[2] Buergenthal, T. (1994). The United Nations Truth Commission for El Salvador. Vanderbilt Journal of Transnational Law 27 497-544.
[3] Buergenthal, T. (1996). La Comisión de la Verdad para El Salvador. In Estudios Especializados de Derechos Humanos I 11-62. Instituto Interamericano de Derechos Humanos, San José, Costa Rica.
[4] Christen, P. (2005). Probabilistic data generation for deduplication and data linkage. In Proceedings of the Sixth International Conference on Intelligent Data Engineering and Automated Learning ( IDEAL’ 05) 109-116. Springer, Berlin.
[5] Christen, P. (2012a). Data Matching : Concepts and Techniques for Record Linkage , Entity Resolution , and Duplicate Detection . Springer, Berlin.
[6] Christen, P. (2012b). A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24 1537-1555.
[7] Christen, P. and Pudjijono, A. (2009). Accurate synthetic generation of realistic personal information. In Advances in Knowledge Discovery and Data Mining (T. Theeramunkong, B. Kijsirikul, N. Cercone and T.-B. Ho, eds.). Lecture Notes in Computer Science 5476 507-514. Springer, Berlin.
[8] Christen, P. and Vatsalan, D. (2013). Flexible and extensible generation and corruption of personal data. In Proceedings of the ACM International Conference on Information and Knowledge Management ( CIKM 2013). ACM, New York.
[9] Commission on the Truth for El Salvador (1993). From madness to hope: The 12-year war in El Salvador: Report of the Commission on the Truth for El Salvador. Available at [Accessed October 15, 2014]. UN Security Council.
[10] Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal Complex Systems 1695.
[11] Elmagarmid, A. K., Ipeirotis, P. G. and Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19 1-16.
[12] Fay, R. E. (2004). An analysis of person duplication in census 2000. In Proceedings of the Section on Survey Research Methods 3478-3485. Amer. Statist. Assoc., Alexandria, VA.
[13] Fellegi, I. P. and Sunter, A. B. (1969). A theory for record linkage. J. Amer. Statist. Assoc. 64 1183-1210. · Zbl 0186.53903
[14] Fernández, E. and García, A. M. (2003). Accuracy of referencing of Spanish names in Medline. The Lancet 361 351-352.
[15] Fortini, M., Liseo, B., Nuccitelli, A. and Scanu, M. (2001). On Bayesian record linkage. Researh in Official Statistics 4 185-198.
[16] Fortini, M., Nuccitelli, A., Liseo, B. and Scanu, M. (2002). Modeling issues in record linkage: A Bayesian perspective. In Proceedings of the Section on Survey Research Methods 1008-1013. Amer. Statist. Assoc., Alexandria, VA.
[17] Gutman, R., Afendulis, C. C. and Zaslavsky, A. M. (2013). A Bayesian procedure for file linking to analyze end-of-life medical costs. J. Amer. Statist. Assoc. 108 34-47. · Zbl 1379.62069 · doi:10.1080/01621459.2012.726889
[18] Herzog, T. N., Scheuren, F. J. and Winkler, W. E. (2007). Data Quality and Record Linkage Techniques . Springer, New York. · Zbl 1262.62004 · doi:10.1007/0-387-69505-2
[19] Hoover Green, A. (2011). Repertoires of violence against noncombatants: The role of armed group institutions and ideologies. Ph.D. thesis, Yale Univ.
[20] Hsu, W., Lee, M. L., Liu, B. and Ling, T. W. (2000). Exploration mining in diabetic patients databases: Findings and conclusions. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ( KDD ’ 00) 430-436. ACM, New York.
[21] Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Amer. Statist. Assoc. 84 414-420.
[22] Keener, R., Rothman, E. and Starr, N. (1987). Distributions on partitions. Ann. Statist. 15 1466-1481. · Zbl 0629.62023 · doi:10.1214/aos/1176350604
[23] Larsen, M. D. (2002). Comments on hierarchical Bayesian record linkage. In Proceedings of the Section on Survey Research Methods 1995-2000. Amer. Statist. Assoc., Alexandria, VA.
[24] Larsen, M. D. (2005). Advances in record linkage theory: Hierarchical Bayesian record linkage theory. In Proceedings of the Section on Survey Research Methods 3277-3284. Amer. Statist. Assoc., Alexandria, VA.
[25] Larsen, M. D. (2012). An experiment with hierarchical Bayesian record linkage. Preprint. Available at .
[26] Larsen, M. D. and Rubin, D. B. (2001). Iterative automated record linkage using mixture models. J. Amer. Statist. Assoc. 96 32-41. · doi:10.1198/016214501750332956
[27] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data , 2nd ed. Wiley, Hoboken, NJ. · Zbl 1011.62004
[28] Lum, K., Price, M. E. and Banks, D. (2013). Applications of multiple systems estimation in human rights research. Amer. Statist. 67 191-200.
[29] Marshall, L. (2008). Potential duplicates in the census: Methodology and selection of cases for followup. In Proceedings of the Section on Survey Research Methods 4237-4244. Amer. Statist. Assoc., Alexandria, VA.
[30] Matsakis, N. E. (2010). Active duplicate detection with Bayesian nonparametric models. Ph.D. thesis, Massachusetts Institute of Technology.
[31] McCullagh, P. (2011). Random permutations and partition models . In International Encyclopedia of Statistical Science 1170-1177. Springer, Berlin.
[32] Miller, P. L., Frawley, S. J. and Sayward, F. G. (2000). IMM/Scrub: A domain-specific tool for the deduplication of vaccination history records in childhood immunization registries. Computers and Biomedical Research 33 126-143.
[33] R Core Team (2013). R : A Language and Environment for Statistical Computing . R Foundation for Statistical Computing, Vienna, Austria.
[34] Rota, G.-C. (1964). The number of partitions of a set. Amer. Math. Monthly 71 498-504. · Zbl 0121.01803 · doi:10.2307/2312585
[35] Ruiz-Pérez, R., López-Cózar, E. D. and Jiménez-Contreras, E. (2002). Spanish personal name variations in national and international biomedical databases: Implications for information retrieval and bibliometric studies. Journal of the Medical Library Association 90 411-430.
[36] Sadinle, M. (2014). Supplement to “Detecting duplicates in a homicide registry using a Bayesian partitioning approach.” . · Zbl 1454.62506 · doi:10.1214/14-AOAS779
[37] Sadinle, M. and Fienberg, S. E. (2013). A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record systems. J. Amer. Statist. Assoc. 108 385-397. · Zbl 06195947 · doi:10.1080/01621459.2012.757231
[38] Sariyar, M. and Borg, A. (2010). The RecordLinkage package: Detecting errors in data. The R Journal 2 61-67.
[39] Sariyar, M., Borg, A. and Pommerening, K. (2009). Evaluation of record linkage methods for iterative insertions. Methods Inf. Med. 48 429-437.
[40] Sariyar, M., Borg, A. and Pommerening, K. (2012). Missing values in deduplication of electronic patient data. Journal of the American Medical Informatics Association 19 e76-e82.
[41] Steorts, R. C., Hall, R. and Fienberg, S. E. (2013). A Bayesian approach to graphical record linkage and deduplication. Preprint. Available at .
[42] Tancredi, A. and Liseo, B. (2011). A hierarchical Bayesian approach to record linkage and population size problems. Ann. Appl. Stat. 5 1553-1585. · Zbl 1223.62015 · doi:10.1214/10-AOAS447
[43] Winkler, W. E. (1988). Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 667-671. Amer. Statist. Assoc., Alexandria, VA.
[44] Winkler, W. E. (1989). Frequency-based matching in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 778-783. Amer. Statist. Assoc., Alexandria, VA.
[45] Winkler, W. E. (1990). String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In Proceedings of the Section on Survey Research Methods 354-359. Amer. Statist. Assoc., Alexandria, VA.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.