Benchmark datasets for entity resolution

Duration

2010-2019

Description

Benchmark datasets for entity resolution:

We offer several datasets for evaluating entity resolution that have been used in our own evaluations and that are made available for other researchers. The initial set of datasets have been used for pairwise matching of entities from two sources. The second set of datasets are also usable for entity clustering, mostly for more than two sources.

Datasets for Binary Entity Resolution:

In the VLDB 2010 paper [1] we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution approaches using challenging real-world match tasks. The evaluation considers existing approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community a state-of-the-art commercial entity resolution implementation is considered. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

Below you can download the datasets used in our evaluation. The zip-files each contain three csv-files: Two are the entity source files, the third one is the perfect mapping. Please refer to the papers to see how we determined the perfect mapping.

Task/source files	Domain	Attributes	#entities	#matches	Used in
Abt-Buy	E-commerce	name, description, manufacturer, price	1081+1092	1097	[1], [2]
Amazon-GoogleProducts	E-commerce	name, description, manufacturer, price	1363+3226	1300	[1], [2]
DBLP-ACM	Bibliographic	title, authors, venue, year	2614+2294	2224	[0], [1], [2]
DBLP-Scholar	Bibliographic	title, authors, venue, year	2616+64263	5347	[0], [1], [2], [3]

January 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and the VLDB2010 paper [1]

Datasets for Entity Clustering:

Entity clustering is commonly used to determine matching entities within a single data source. It is also needed for matching entities from multiple (>2) sources to group all matching entities from different soutrces within entity clusters. We provide one benchmark dataset for single-source entity clustering and several ones for multi-source entity clustering with more than two duplicate-free sources:

The affiliation dataset contain affilations trings extracted from the PDFs of database publications that appeared between 2000 and 2009. The strings often contain multiple address components such as the name of the institution/company and often also the city name.
Geographic Settlements contains geographical real-world entities from four different data sources (DBpedia, Geonames, Freebase, NYTimes) and has already been used in the OAEI competition.
The Music Brainz dataset is based on real records about songs from the MusicBrainz database but uses the DAPO data generator to create duplicates with modified attribute values. The generated dataset consists of five sources and contains duplicates for 50% of the original records in two to five sources. All duplicates are generated with a high degree of corruption to stress-test the ER and clustering approaches.
The North Carolina Voters dataset is based on real person records from the North-Carolina voter registry and synthetically generated duplicates using the tool GeCo. We consider two configurations with either 5 or 10 sources each having 1 million entities; i.e. we process up to 10 million person records. Each source is duplicate free, but 50% of the entities are replicated in all sources without any corruption. Moreover, 25% of entities are corrupted and replicated in all sources, and the remaining 25% are corrupted but present in only some sources. For the generation of corrupted records we applied a moderate corruption rate of 20%, i.e., most attribute values remained unchanged.

Task/source files	Domain	Attributes	#sources	#entities	#matches	#clusters	Used in	More information
Affiliations	Bibliographic, geography	affiliation string	1	2,260	32,816	330	[9], [10]
Geographic Settlements	geography	name, longitude, latitude	4	3,054	4,391	820	[4], [5], [6], [7], [8]
Music Brainz 20K		artist, title, album, year, length	5	19,375	16,250	10,000	[4], [5], [6], [7], [8]	readMe
Music Brainz 200K			5	193,750	162,500	100,000
Music Brainz 2M			5	1,937,500	1,624,503	1,000,000	[7], [8]
Music Brainz 20M			5	19,375,000	16,250,000	10,000,000	[7]
North Carolina Voters 5M	Persons	name, surname, suburb, postcode	5	5,000,000	3,331,384	3,500,840	[4], [5], [6], [8]	readMe
North Carolina Voters 10M			10	10,000,000	14,995,973	6,625,848	[4], [6], [8]

June 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and, for the multi-source datasets, to the ADBIS2017 paper [4]

Publikationen (18)

Dateien	Cover	Beschreibung	Jahr
		KGpipe: Generation and Evaluation of Pipelines for Data Integration into Knowledge Graphs Hofer, M. ; Rahm, E. arxiv	2025 / 11
		Matching Entities from Multiple Sources with Hierarchical Agglomerative Clustering Saeedi, A. ; David, L. ; Rahm, E. KEOD 2021	2021 / 10
		Extended Affinity Propagation Clustering for Multi-source Entity Resolution Lerm, S. ; Saeedi, A. ; Rahm, E. BTW 2021	2021 / 3
		Incremental Multi-source Entity Resolution for Knowledge Graph Completion Saeedi, A. ; Peukert, E. ; Rahm, E. Proc. ESWC	2020 / 6
		Knowledge Graph Completion with FAMER Obraczka, D. ; Saeedi, A. ; Rahm, E. Proc. KDD workshop on Data Integration to Knowledge Graphs (DI2KG) (DI2KG Challenge Winner)	2019 / 8
		Incremental Clustering on Linked Data Nentwig, M. ; Rahm, E. Proc. IEEE International Conference on Data Mining Workshop, ICDMW 2018, Singapore	2018 / 11
		Scalable Matching and Clustering of Entities with FAMER Saeedi, A. ; Nentwig, M. ; Peukert, E. ; Rahm, E. Complex Systems Informatics and Modeling Quarterly (CSIMQ), Issue 16, Sep./Oct. 2018, pp 61–83	2018 / 11
		Using Link Features for Entity Clustering in Knowledge Graphs Saeedi, A. ; Peukert, E. ; Rahm, E. Proc. ESWC 2018 (Best research paper award)	2018 / 6
		Interactive Visualization of Large Similarity Graphs and Entity Resolution Clusters Rostami, M. ; Saeedi, A. ; Peukert, E. ; Rahm, E. Proc. EDBT 2018	2018 / 3
		Distributed Holistic Clustering on Linked Data Nentwig, M. ; Groß, A. ; Möller, M. ; Rahm, E. Proc. OTM 2017 - Confederated International Conferences: CoopIS, C&TC, and ODBASE 2017, LNCS 10574, pp 371-382	2017 / 10

Database Group Leipzig

within the department of computer science

Benchmark datasets for entity resolution

Duration

Description

Publikationen (18)

Recent publications