German English

Privacy-Preserving Record Linkage using Autoencoders

Google Scholar

publication iconChristen, Victor; Häntschel, Tim; Christen, Peter; Rahm, Erhard
Privacy-Preserving Record Linkage using Autoencoders
International Journal of Data Science and Analytics (accepted)


Privacy-preserving record linkage (PPRL) is the process aimed at identifying records that represent the same real-world entity across different data sources while guaranteeing the privacy of sensitive information about these entities. A popular PPRL method is to encode sensitive plain-text data into Bloom filters (BFs), bit vectors that enable the efficient calculation of similarities between records that is required for PPRL. However, BF encoding cannot completely prevent there-identification of plain-text values because sets of BFs can contain bit patterns that can be mapped to plain-text values using cryptanalysis attacks. Various hardening techniques have therefore been proposed that modify the bit patterns in BFs with the aim to prevent such attacks. However, it has been shown that even hardened BFs can still be vulnerable to attacks. To avoid any such attacks, we propose a novel encoding technique for PPRL based on autoencoders that transforms BFs into vectors of real numbers. To achieve a high comparison quality of the generated numerical vectors, we propose a method that guarantees the comparability of encodings generated by the different data owners. Experiments on real-world data sets show that our technique achieves high linkage quality and prevents known cryptanalysis attacks on BF encoding.