(Privately) Estimating Linkage Quality for Record Linkage
27th International Conference on Extending Database Technology
Futher information: http://dx.doi.org/10.48786/edbt.2024.26
Record linkage is the task of identifying records from different databases that refer to the same real-world entity. This task is an essential component of data integration to facilitate data analysis in a variety of domains, including healthcare, national security, and e-commerce. To evaluate the quality of record linkage approaches, the performance measures of precision, recall, and F-measure are commonly used. These measures require ground truth data that specifies known matches and non-matches. However, in practical linkage applications there typically is no such ground truth data available. Although linkage quality can be assessed manually by domain experts, such a clerical review process is time- and resource-consuming and generally not feasible when linking databases that are very large or that contain sensitive (personal) data. We review existing and propose improved unsupervised approaches for estimating the quality of linkage results. We evaluate our approaches on multiple datasets from three different domains. This evaluation shows that our approaches outperform existing methods and lead to estimates that are close to the actual linkage quality.