German English

Benchmark datasets for entity resolution

Benchmark datasets for entity resolution

In the VLDB 2010 paper [1] we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution approaches using challenging real-world match tasks. The evaluation considers existing approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community a state-of-the-art commercial entity resolution implementation is considered. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

Datasets

Below you can download the datasets used in our evaluation. The zip-files each contain three csv-files: Two are the entity source files, the third one is the perfect mapping. Please refer to the papers to see how we determined the perfect mapping.

DomainAttributesSources Used in
Bibliographictitle, authors, venue, yearDBLP-ACM[1], [2], [3]
Bibliographictitle, authors, venue, yearDBLP-Scholar[1], [2], [3]
E-commercename, description, manufacturer, price Amazon-GoogleProducts[1], [2]
E-commercename, description, manufacturer, priceAbt-Buy[1], [2]

See also our dataset of affiliation strings.

Publications

PDF

Google Scholar
Petermann, A.; Junghanns, M.; Müller, R.; Rahm, E.
FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics
5th Workshop on Big Data Benchmarking (WBDB 2014), LNCS 8991, 2015
2014-08
PDF

Google Scholar
Köpcke, H.; Thor, A.; Rahm, E.
Evaluation of entity resolution approaches on real-world match problems
Proc. 36th Intl. Conference on Very Large Databases (VLDB) / Proceedings of the VLDB Endowment 3(1), 2010
2010-09
PDF
further information
Google Scholar
Köpcke, H.; Thor, A.; Rahm, E.
Learning-based approaches for matching web data entities
IEEE Internet Computing 14(4), 2010
2010-07
PDF

Google Scholar
Thor, A.; Rahm, E.
MOMA - A Mapping-based Object Matching System
Proc. 3rd Conference on Innovative Data Systems Research (CIDR), 2007
2007-01
PDF
further information
Google Scholar
Köpcke, H.; Rahm, E.
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
2010-01
PDF

Google Scholar
Köpcke, H.; Thor, A.; Rahm, E.
Comparative evaluation of entity resolution approaches with FEVER
Proc. 35th Intl. Conference on Very Large Databases (VLDB), 2009 (demo)
2009-08

Contact/Project Members