Finding similar records in different datasets is a crucial step for data integration. This record linkage (entity resolution) challenge usually requires a complex workflow with steps for blocking (indexing), similarity computation and classification. A new challenge to be addressed besides accuracy and scalability is confidentiality. If sensitive person-related records, e.g. patient data, are to be matched, they must be encoded up-front to make a decryption from a third party impossible. Privacy-preserving record linkage (PPRL) considers the topic of matching records whose attribute values (name, date of birth, address) are tokenized, hashed, and converted to bitsets. Thus, the linkage of entities referring to the same real world entity can not make use of traditional techniques based on the similarity of attribute values. Instead, it is required to solely rely on the similarity of bitlists. At the same time, the inherent squared complexity for pairwise similarity computation and the large amount of data to process, state-of-the art approaches demand very high resources that limit their applicability to large-scale problems.
We investigate how to extend record linakge techniques for privacy-sensitive applications where it is not possible to use the original attribute values. We plan to extend our existing Hadoop-based framework for entity matching by new techniques to support the automatic, scalable, and high-quality matching of encrypted data. Furthermore, we want to exploit the use of massively parallel graphic processors (GPUs) to further speed up the time-consuming process of privacy-preserving entity resolution.
This project builds upon our previous work on the Hadoop-based entity matching prototype