German English

Benchmark datasets for entity resolution

Benchmark datasets for entity resolution

We offer several datasets for evaluating entity resolution that have been used in our own evaluations and that are made available for other reseachers. The initial set of datasets have been used for parwise matching of entities from two sources. The second set of datasets are also usable for entity clustering, mostly for more than two sources.

Datasets for Binary Entity Resolution

In the VLDB 2010 paper [1] we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution approaches using challenging real-world match tasks. The evaluation considers existing approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community a state-of-the-art commercial entity resolution implementation is considered. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

Below you can download the datasets used in our evaluation. The zip-files each contain three csv-files: Two are the entity source files, the third one is the perfect mapping. Please refer to the papers to see how we determined the perfect mapping.

Task/source files DomainAttributes#entities#matchesUsed in
DBLP-ACMBibliographictitle, authors, venue, year2614+22942224[0], [1], [2]
DBLP-ScholarBibliographictitle, authors, venue, year2616+642635347[0], [1], [2], [3]
Amazon-GoogleProductsE-commercename, description, manufacturer, price1363+32261300[1], [2]
Abt-BuyE-commercename, description, manufacturer, price1081+10921097[1], [2]

January 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and the VLDB2010 paper [1]

Datasets for Entity Clustering

Entity clustering is commonly used to determine matching entities within a single data source. It is also needed for matching entities from multiple (>2) sources to group all matching entities from different soutrces within entity clusters. We provide one benchmark dataset for single-source entity clustering and several ones for multi-source entity clustering with more than two duplicate-free sources:
  • The affiliation dataset contain affilations trings extracted from the PDFs of database publications that appeared between 2000 and 2009. The strings often contain multiple address components such as the name of the institution/company and often also the city name.
  • Geographic Settlements contains geographical real-world entities from four different data sources (DBpedia, Geonames, Freebase, NYTimes) and has already been used in the OAEI competition.
  • The Music Brainz dataset is based on real records about songs from the MusicBrainz database but uses the DAPO data generator to create duplicates with modified attribute values. The generated dataset consists of five sources and contains duplicates for 50% of the original records in two to five sources. All duplicates are generated with a high degree of corruption to stress-test the ER and clustering approaches.
  • The North Carolina Voters dataset is based on real person records from the North-Carolina voter registry and synthetically generated duplicates using the tool GeCo. We consider two configurations with either 5 or 10 sources each having 1 million entities; i.e. we process up to 10 million person records. Each source is duplicate free, but 50% of the entities are replicated in all sources without any corruption. Moreover, 25% of entities are corrupted and replicated in all sources, and the remaining 25% are corrupted but present in only some sources. For the generation of corrupted records we applied a moderate corruption rate of 20%, i.e., most attribute values remained unchanged.
Task/source files DomainAttributes #sources#entities#matches#clustersused inmore information
AffiliationsBibliographic, geographyaffiliation string12,26032,816330[9], [10]
Geographic Settlementsgeography name, longitude, latitude43,0544,391820[4], [5], [6], [7], [8]
Music Brainz 20KMusicartist, title, album, year, length 519,37516,25010,000[4], [5], [6], [7], [8]readMe
Music Brainz 200K 5193,750162,500100,000
Music Brainz 2M 51,937,5001,624,5031,000,000[7], [8]
Music Brainz 20M 519,375,00016,250,00010,000,000[7]
North Carolina Voters 5MPersons name, surname, suburb, postcode55,000,0003,331,3843,500,840[4], [5], [6], [8]readMe
North Carolina Voters 10M 1010,000,00014,995,9736,625,848[4], [6], [8]

June 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and, for the multi-source datasets, to the ADBIS2017 paper [4]

Publications

PDF
further information
Google Scholar
Nentwig, Markus; Groß, Anika; Möller, Maximilian; Rahm, Erhard
Distributed Holistic Clustering on Linked Data
Proc. OTM 2017 - Confederated International Conferences: CoopIS, C&TC, and ODBASE 2017, LNCS 10574, pp 371-382
2017-10
PDF

Google Scholar
Petermann, A.; Junghanns, M.; Müller, R.; Rahm, E.
FoodBroker - Generating Synthetic Datasets for Graph-Based Business Analytics
5th Workshop on Big Data Benchmarking (WBDB 2014), LNCS 8991, 2015
2014-08
PDF
further information
Google Scholar
Aumueller, D.; Rahm, E.
Affiliation analysis of database publications
ACM SIGMOD Record, Vol. 40, No. 1, pp 26-31, March 2011
2011-03-31
PDF

Google Scholar
Köpcke, H.; Thor, A.; Rahm, E.
Evaluation of entity resolution approaches on real-world match problems
Proc. 36th Intl. Conference on Very Large Databases (VLDB) / Proceedings of the VLDB Endowment 3(1), 2010
2010-09
PDF
further information
Google Scholar
Köpcke, H.; Thor, A.; Rahm, E.
Learning-based approaches for matching web data entities
IEEE Internet Computing 14(4), 2010
2010-07
PDF
further information
Google Scholar
Aumueller, David; Rahm, Erhard
Web-based Affiliation Matching
14th International Conference on Information Quality 2009 (ICIQ’09)
2009-11
PDF

Google Scholar
Thor, A.; Rahm, E.
MOMA - A Mapping-based Object Matching System
Proc. 3rd Conference on Innovative Data Systems Research (CIDR), 2007
2007-01
PDF
further information
Google Scholar
Köpcke, H.; Rahm, E.
Frameworks for entity matching: A comparison
Data & Knowledge Engineering
2010-01
PDF

Google Scholar
Köpcke, H.; Thor, A.; Rahm, E.
Comparative evaluation of entity resolution approaches with FEVER
Proc. 35th Intl. Conference on Very Large Databases (VLDB), 2009 (demo)
2009-08
PDF

Google Scholar
publication iconNentwig, Markus; Rahm, Erhard
Incremental Clustering on Linked Data
Proc. IEEE International Conference on Data Mining Workshop, ICDMW 2018, Singapore
2018-11
PDF
further information
Google Scholar
publication iconSaeedi, Alieh; Nentwig, Markus; Peukert, Eric; Rahm, Erhard
Scalable Matching and Clustering of Entities with FAMER
Complex Systems Informatics and Modeling Quarterly (CSIMQ), Issue 16, Sep./Oct. 2018, pp 61–83
2018-11
PDF

Google Scholar
Saeedi, Alieh; Peukert, Eric; Rahm, Erhard
Using Link Features for Entity Clustering in Knowledge Graphs
Proc. ESWC 2018 (Best research paper award)
2018-06
PDF

Google Scholar
Rostami, M. Ali; Saeedi, Alieh; Peukert, Eric ; Rahm, Erhard
Interactive Visualization of Large Similarity Graphs and Entity Resolution Clusters
Proc. EDBT 2018
2018-03
PDF

Google Scholar
Saeedi, Alieh; Peukert, Eric; Rahm, Erhard
Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution
Proc. ADBIS, LNCS 10509, pp 278-293
2017-09

Contact/Project Members