Skip to main content

User account menu

  • Log in
DBS-Logo

Database Group Leipzig

within the department of computer science

ScaDS-Logo Logo of the University of Leipzig

Main navigation

  • Home
  • Study
    • Exams
      • Hinweise zu Klausuren
    • Courses
      • Current
    • Modules
    • LOTS-Training
    • Abschlussarbeiten
    • Masterstudiengang Data Science
    • Oberseminare
    • Problemseminare
    • Top-Studierende
  • Research
    • Projects
      • Benchmark datasets for entity resolution
      • FAMER
      • HyGraph
      • Privacy-Preserving Record Linkage
      • GRADOOP
    • Publications
    • Prototypes
    • Annual reports
    • Cooperations
    • Graduations
    • Colloquia
    • Conferences
  • Team
    • Erhard Rahm
    • Member
    • Former employees
    • Associated members
    • Gallery

Data Partitioning for Parallel Entity Matching

Breadcrumb

  • Home
  • Research
  • Publications
  • Data Partitioning for Parallel Entity Matching

Kirsten, T. ; Kolb, L. ; Hartung, M. ; Groß, A. ; Köpcke, H. ; Rahm, E.

Data Partitioning for Parallel Entity Matching

Proc. 8th Intl. Workshop on Quality in Databases (QDB), 2010

2010 / 09

Andere

Abstract

<p style="text-align:justify;">
Entity matching is an important and difficult step for integrating web data. To reduce the typically high execution time for matching we investigate how we can perform entity matching in parallel on a distributed infrastructure. We propose different strategies to partition the input data and generate multiple match tasks that can be independently executed. One of our strategies supports both, blocking to reduce the search space for matching and parallel matching to improve efficiency. Special attention is given to the number and size of data partitions as they impact the overall communication overhead and memory requirements of individual match tasks. We have developed a service-based distributed infrastructure for the parallel execution of match workflows. We evaluate our approach in detail for different match strategies for matching real-world product data of different web shops. We also consider caching of input entities and affinity-based scheduling of match tasks.
</p>

<a href="file/Kolb_QDB_2010.pptx" style="float:left; margin-left:20px;">
<img title="Presentation@QDB 2010" src="file/Kolb_QDB_2010.png" width="180" height="135" alt="Presentation" style="border:1px solid grey;"/>
</a>
<br style="clear:left;"/>

<h2>Keywords</h2>
<ul>
<li>Distributed computing, Parallelism</li>
<li>Entity Resolution, Object matching, Similarity Join, Pair-wise comparison</li>
<li>Blocking, Clustering</li>
</ul>

<h2 id="bibtex_heading">BibTex</h2>
<pre id="bibtex_listing">
@inproceedings{data_partitioning_for_parallel_entity_matching,
author = {Toralf Kirsten and
Lars Kolb and
Michael Hartung and
Anika Gross and
Hanna K{\\"o}pcke and
Erhard Rahm},
title = {{Data Partitioning for Parallel Entity Matching}},
booktitle = {8th International Workshop on Quality in Databases},
year = {2010}
}
</pre>

Recent publications

  • 2025 / 9: Generating Semantically Enriched Mobility Data from Travel Diaries
  • 2025 / 8: Slice it up: Unmasking User Identities in Smartwatch Health Data
  • 2025 / 7: MPGT: Multimodal Physics-Constrained Graph Transformer Learning for Hybrid Digital Twins
  • 2025 / 6: Leveraging foundation models and goal-dependent annotations for automated cell confluence assessment
  • 2025 / 6: SecUREmatch: Integrating Clerical Review in Privacy-Preserving Record Linkage

Footer menu

  • Directions
  • Contact
  • Impressum