German English

Privacy Preserving Record Linkage (PPRL) for Big Data


Finding similar records in different datasets is a crucial step for data integration. This record linkage (entity resolution) challenge usually requires a complex workflow with steps for blocking (indexing), similarity computation and classification. A new challenge to be addressed besides accuracy and scalability is confidentiality. If sensitive person-related records, e.g. patient data, are to be matched, they must be encoded up-front to make a decryption from a third party impossible. Privacy-preserving record linkage (PPRL) considers the topic of matching records whose attribute values (name, date of birth, address) are tokenized, hashed, and converted to bitsets. Thus, the linkage of entities referring to the same real world entity can not make use of traditional techniques based on the similarity of attribute values. Instead, it is required to solely rely on the similarity of bitlists. At the same time, the inherent squared complexity for pairwise similarity computation and the large amount of data to process, state-of-the art approaches demand very high resources that limit their applicability to large-scale problems.


We investigate how to extend record linakge techniques for privacy-sensitive applications where it is not possible to use the original attribute values. We plan to extend our existing Hadoop-based framework for entity matching by new techniques to support the automatic, scalable, and high-quality matching of encrypted data. Furthermore, we want to exploit the use of massively parallel graphic processors (GPUs) to further speed up the time-consuming process of privacy-preserving entity resolution.

Previous Work

This project builds upon our previous work on the Hadoop-based entity matching prototype Dedoop.

Project Members

Related Publications


Google Scholar
Vatsalan, Dinusha; Sehili, Ziad; Peter Christen, Peter; Rahm, Erhard
Privacy-Preserving Record Linkage for Big Data: Current Approaches and Research Challenges
In: Handbook of Big Data Technologies (eds.: A. Zomaya, S. Sakr) , Springer 2017, to appear

Google Scholar
Vatsalan, D.; Christen, P.; Rahm, E.
Scalable privacy-preserving linking of multiple databases using Counting Bloom filters
Proc ICDM workshop on Privacy and Discrimination in Data Mining (PDDM)
further information
Google Scholar
Sehili, Ziad; Rahm, Erhard
Speeding up Privacy Preserving Record Linkage for Metric Space Similarity Measures
Datenbankspektrum 16, pp. 227-236

Google Scholar
Sehili, Z.; Kolb, L.; Borgs, C.; Schnell, R.; Rahm, E.
Privacy Preserving Record Linkage with PPJoin
Proc. of 16. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW), 2015