Skip to main content

User account menu

  • Log in
DBS-Logo

Database Group Leipzig

within the department of computer science

ScaDS-Logo Logo of the University of Leipzig

Main navigation

  • Home
  • Study
    • Exams
      • Hinweise zu Klausuren
    • Courses
      • Current
    • Modules
    • LOTS-Training
    • Abschlussarbeiten
    • Masterstudiengang Data Science
    • Oberseminare
    • Problemseminare
    • Top-Studierende
  • Research
    • Projects
      • Benchmark datasets for entity resolution
      • FAMER
      • HyGraph
      • Privacy-Preserving Record Linkage
      • GRADOOP
    • Publications
    • Prototypes
    • Annual reports
    • Cooperations
    • Graduations
    • Colloquia
    • Conferences
  • Team
    • Erhard Rahm
    • Member
    • Former employees
    • Associated members
    • Gallery

A Clustering-Based Framework to Control Block Sizes for Entity Resolution

Breadcrumb

  • Home
  • A Clustering-Based Framework to Control Block Sizes for Entity Resolution

Fisher, J. ; Christen, P. ; Wang, Q. ; Rahm, E.

A Clustering-Based Framework to Control Block Sizes for Entity Resolution

Proc. 21st ACM SIGKDD Conf. on Knowledge Discovery and Mining (KDD), 279-288, 2015

2015 / 08

Paper

Futher information: http://dx.doi.org/10.1145/2783258.2783396

Abstract

Entity resolution (ER) is a common data cleaning task that involves determining which records from one or more data sets refer to the same real-world entities. Because a pairwise comparison of all records scales quadratically with the number of records in the data sets to be matched, it is common to use blocking or indexing techniques to reduce the number of comparisons required. These techniques split the data sets into blocks and only records within blocks are compared with each other. Most existing blocking techniques do not provide control over the size of the generated blocks, despite its importance in many practical applications of ER, such as privacy-preserving record linkage and real-time ER.We propose two novel hierarchical clustering approaches which can generate blocks within a specified size range, and we present a penalty function which allows control of the trade-off between block quality and block size in the clustering process. We evaluate our techniques on three real-world data sets and compare them against three baseline approaches, standard blocking, Soundex encoding and sorted neighbourhood based indexing. Our results show that the proposed techniques perform well on the measures of pairs completeness and reduction ratio compared with the baseline approaches, while achieving the required block size restrictions.

Recent publications

  • 2025 / 8: Slice it up: Unmasking User Identities in Smartwatch Health Data
  • 2025 / 6: SecUREmatch: Integrating Clerical Review in Privacy-Preserving Record Linkage
  • 2025 / 5: Federated Learning With Individualized Privacy Through Client Sampling
  • 2025 / 3: Automated Configuration of Schema Matching Tools: A Reinforcement Learning Approach
  • 2025 / 3: Assessing the Impact of Image Dataset Features on Privacy-Preserving Machine Learning

Footer menu

  • Directions
  • Contact
  • Impressum