Futher information: https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-912319
Abstract
The digital revolution has led to rapid growth in the amount of data that is collected by organizations. Analyzing large data collections through the use of data mining and predictive analytics allows organizations to make data-driven decisions to improve efficiency, effectiveness, or profitability. Much of the data collected by organizations, such as government agencies, healthcare providers, or insurance companies, is related to individuals. However, each organization typically maintains an independent database containing specific data for a concrete purpose. Therefore, different databases typically store different data (variables) about a certain group of individuals. Complex research questions often cannot be answered solely based on one database because either certain data (co-variables) are missing, or the sample size is too small. Comprehensive data analysis therefore often requires the integration (combination) of data from multiple autonomous and thus heterogeneous databases. For holistic medical research, for instance, medical care data could be combined with clinical trial data, data from disease registries, or social data to uncover hidden correlations and improve prevention, therapy, and care. A crucial part of data integration is to identify records from different databases that refer to the same real-world entity, such as a patient. This task is known as record linkage and relies on comparing available quasi-identifiers, such as names, addresses, and dates of birth of patients. However, exchanging such personal data between different organizations conflicts with privacy and is often not permitted, since personal data is protected by strict legal regulations. Privacy-preserving record linkage (PPRL) addresses this problem by providing techniques for linking records while preserving the privacy of represented entities (individuals). PPRL approaches have to ensure that no private or confidential information is revealed during the linkage that would allow to (re-)identify an individual. Therefore, quasi-identifying attribute values are encoded (masked) and the linkage is conducted on encoded records. PPRL techniques pose three key challenges that need to be addressed, namely (1) reaching high efficiency with scalability to large and potentially many databases; (2) achieving high linkage quality by avoiding false and missing matches; and (3) ensuring a high degree of privacy by providing secure encodings and linkage protocols. In this thesis, we present extensive research that addresses all three key challenges of PPRL by providing solutions to several problems and shortcomings identified in existing PPRL approaches. First, we focus on improving the scalability and overall performance of PPRL by investigating locality-sensitive hashing (LSH) as a private blocking method to reduce the number of record pair comparisons. Furthermore, we develop parallel PPRL approaches that build on the modern distributed processing framework Apache Flink. We show that our approaches achieve high efficiency and effectiveness, scaling up to linking tens of millions of records. Second, we address the challenge of achieving high linkage quality in privacy-constrained linkage scenarios. Most existing PPRL approaches rely on a simple threshold-based classification, and thus likely fail to achieve accurate linkage results when dealing with dirty data containing errors and inconsistencies. Therefore, we examine post-processing methods for removing match candidates that are unlikely to match. We show that post-processing raises the overall linkage quality by limiting the number of false matches. Another problem is that measures to evaluate the quality of record linkage approaches require ground truth data that specifies known matches and non-matches. In many record linkage applications, however, ground truth data is not available. Additionally, in privacy-preserving linkage scenarios, a manual classification (clerical review) is generally not possible since inspecting actual attribute values of classified record pairs can reveal the identity of an individual. Therefore, we propose unsupervised approaches for estimating the quality of linkage results. The estimates can be used in practice, in particular, to optimize linkage configurations, such as the classification threshold. Third, we focus on the privacy aspect and review hardening techniques that aim to improve the privacy (security) of encoding schemes based on Bloom filters. Bloom filter encodings are frequently used in both research and practical applications. However, several attacks have been proposed showing that Bloom filter encodings are susceptible to cryptanalysis. We comprehensively evaluate the proposed hardening techniques in terms of privacy and linkage quality to assess their practicability and their effectiveness in counteracting attacks. Finally, we present Primat, an open-source toolbox for the definition and execution of tailored PPRL workflows. Primat offers several components for the different linkage participants that provide state-of-the-art PPRL methods, including various encoding and hardening techniques, LSH-based blocking, and post-processing methods.