German English

FAst Multi-source Entity Resolution system (FAMER)

Overview

FAMER is a new scalable framework for distributed multi-source entity resolution. It can construct similarity graphs for entities of multiple sources based on different linking schemes; existing links from the Web of Data could also be used to build the similarity graph. FAMER also provides several entity clustering schemes. They use the similarity graph to determine groups of matching entities aiming at maximizing the similarity between entities within a cluster and minimizing the similarity between entities of different clusters. Moreover, FAMER is able to repair clusters, e.g. that are overlapping and/or source-inconsistent. We have also developed a visualization tool, SIMG-Viz to visually analyze the similarity graphs and clusters determined by FAMER.

FAMER is implemented using Apache Flink™ so that the calculation of similarity graphs and the clustering approaches can be executed in parallel on clusters of variable size. For the implementation of the parallel clustering schemes we also use the Gelly library of Flink supporting a so-called vertex-centric programming of graph algorithms to iteratively execute a user-defined program in parallel over all vertices of a graph.

Aims

  • Efficient parallel execution of match workflows in the cloud
  • Efficient application of clustering schemes for entity matching
  • Efficient methods for entity matching repairing

Workflow

The input of FAMER are entities from multiple sources and the output is a set of clusters containing matching entities. The first part named as Linking is a configurable component to generate a similarity graph where similar entities are linked pairwise with each other. This phase starts with blocking, e.g. using Standard Blocking on a specific property, so that only entities of the same block need to be compared with each other. Pair-wise matching is typically based on the combined similarity of several properties and a threshold for the minimal similarity. A future version of FAMER will also support the use of learned models for binary match classification. Currently we are mostly focused on the second part of FAMER to use the similarity graph to determine entity clusters. In the initial version, we support six ER clustering approaches named as connected components, correlation clustering (CCPivot), Center, Merge Center and two variations of star clustering that can lead to overlapping clusters. Later we proposed a new approach CLIP (CLustering based on LInk Priority) which produces clusters with no overlap and no source inconsistency in case all data sources are clean (no duplicate inside each data source). We also augmented FAMER by a repairing component RLIP that is enable to resolve the overlaps and solve the problem of source inconsistency. You can find the java code of both CLIP and RLIP here.

Acknowledgements

This work is partially funded by the German Federal Ministry of Education and Research under project ScaDS Dresden/Leipzig (BMBF 01IS14014B).


Competence Center for Scalable Data Services and Solutions (ScaDS)

Project Members

Scientific / Student Assistants

Volodymyr Moroz

Publications

PDF
further information
Google Scholar
publication iconSaeedi, Alieh; Nentwig, Markus; Peukert, Eric; Rahm, Erhard
Scalable Matching and Clustering of Entities with FAMER
Complex Systems Informatics and Modeling Quarterly (CSIMQ), Issue 16, Sep./Oct. 2018, pp 61–83
2018-11
PDF

Google Scholar
Saeedi, Alieh; Peukert, Eric; Rahm, Erhard
Using Link Features for Entity Clustering in Knowledge Graphs
Proc. ESWC 2018 (Best research paper award)
2018-06
PDF

Google Scholar
Rostami, M. Ali; Saeedi, Alieh; Peukert, Eric ; Rahm, Erhard
Interactive Visualization of Large Similarity Graphs and Entity Resolution Clusters
Proc. EDBT 2018
2018-03
PDF

Google Scholar
Saeedi, Alieh; Peukert, Eric; Rahm, Erhard
Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution
Proc. ADBIS, LNCS 10509, pp 278-293
2017-09

Related Datasets

DS1- Geographical Settlements (read more)
DS2- Music Brainz (read more)
DS3- North Carolina Voters (read more)