Duration
Description
Benchmark datasets for entity resolution:
We offer several datasets for evaluating entity resolution that have been used in our own evaluations and that are made available for other researchers. The initial set of datasets have been used for pairwise matching of entities from two sources. The second set of datasets are also usable for entity clustering, mostly for more than two sources.
Datasets for Binary Entity Resolution:
In the VLDB 2010 paper [1] we present a first comparative evaluation on the relative match quality and runtime efficiency of entity resolution approaches using challenging real-world match tasks. The evaluation considers existing approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community a state-of-the-art commercial entity resolution implementation is considered. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging resolution tasks such as matching product entities from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.
Below you can download the datasets used in our evaluation. The zip-files each contain three csv-files: Two are the entity source files, the third one is the perfect mapping. Please refer to the papers to see how we determined the perfect mapping.
Task/source files | Domain | Attributes | #entities | #matches | Used in | |
Abt-Buy | E-commerce | name, description, manufacturer, price | 1081+1092 | 1097 | [1], [2] | |
Amazon-GoogleProducts | E-commerce | name, description, manufacturer, price | 1363+3226 | 1300 | [1], [2] | |
DBLP-ACM | Bibliographic | title, authors, venue, year | 2614+2294 | 2224 | [0], [1], [2] | |
DBLP-Scholar | Bibliographic | title, authors, venue, year | 2616+64263 | 5347 | [0], [1], [2], [3] |
January 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and the VLDB2010 paper [1]
Datasets for Entity Clustering:
Entity clustering is commonly used to determine matching entities within a single data source. It is also needed for matching entities from multiple (>2) sources to group all matching entities from different soutrces within entity clusters. We provide one benchmark dataset for single-source entity clustering and several ones for multi-source entity clustering with more than two duplicate-free sources:
- The affiliation dataset contain affilations trings extracted from the PDFs of database publications that appeared between 2000 and 2009. The strings often contain multiple address components such as the name of the institution/company and often also the city name.
- Geographic Settlements contains geographical real-world entities from four different data sources (DBpedia, Geonames, Freebase, NYTimes) and has already been used in the OAEI competition.
- The Music Brainz dataset is based on real records about songs from the MusicBrainz database but uses the DAPO data generator to create duplicates with modified attribute values. The generated dataset consists of five sources and contains duplicates for 50% of the original records in two to five sources. All duplicates are generated with a high degree of corruption to stress-test the ER and clustering approaches.
- The North Carolina Voters dataset is based on real person records from the North-Carolina voter registry and synthetically generated duplicates using the tool GeCo. We consider two configurations with either 5 or 10 sources each having 1 million entities; i.e. we process up to 10 million person records. Each source is duplicate free, but 50% of the entities are replicated in all sources without any corruption. Moreover, 25% of entities are corrupted and replicated in all sources, and the remaining 25% are corrupted but present in only some sources. For the generation of corrupted records we applied a moderate corruption rate of 20%, i.e., most attribute values remained unchanged.
Task/source files | Domain | Attributes | #sources | #entities | #matches | #clusters | Used in | More information |
Affiliations | Bibliographic, geography | affiliation string | 1 | 2,260 | 32,816 | 330 | [9], [10] | |
Geographic Settlements | geography | name, longitude, latitude | 4 | 3,054 | 4,391 | 820 | [4], [5], [6], [7], [8] | |
Music Brainz 20K | artist, title, album, year, length | 5 | 19,375 | 16,250 | 10,000 | [4], [5], [6], [7], [8] | readMe | |
Music Brainz 200K | 5 | 193,750 | 162,500 | 100,000 | ||||
Music Brainz 2M | 5 | 1,937,500 | 1,624,503 | 1,000,000 | [7], [8] | |||
Music Brainz 20M | 5 | 19,375,000 | 16,250,000 | 10,000,000 | [7] | |||
North Carolina Voters 5M | Persons | name, surname, suburb, postcode | 5 | 5,000,000 | 3,331,384 | 3,500,840 | [4], [5], [6], [8] | readMe |
North Carolina Voters 10M | 10 | 10,000,000 | 14,995,973 | 6,625,848 | [4], [6], [8] |
June 2019: These datasets are made available by the database group of Prof. Erhard Rahm under the Creative Commons license. To give proper credit please refer to this website and, for the multi-source datasets, to the ADBIS2017 paper [4]
Publikationen (17)
Dateien | Cover | Beschreibung | Jahr Sort ascending |
---|---|---|---|
2021 / 3 | |||
2021 / 10 | |||
2020 / 6 | |||
Obraczka, D.
; Saeedi, A.
; Rahm, E.
Proc. KDD workshop on Data Integration to Knowledge Graphs (DI2KG) (DI2KG Challenge Winner)
|
2019 / 8 | ||
2018 / 3 | |||
2018 / 6 | |||
Nentwig, M.
; Rahm, E.
Proc. IEEE International Conference on Data Mining Workshop, ICDMW 2018, Singapore
|
2018 / 11 | ||
Saeedi, A.
; Nentwig, M.
; Peukert, E.
; Rahm, E.
Complex Systems Informatics and Modeling Quarterly (CSIMQ), Issue 16, Sep./Oct. 2018, pp 61–83
|
2018 / 11 | ||
Nentwig, M.
; Groß, A.
; Möller, M.
; Rahm, E.
Proc. OTM 2017 - Confederated International Conferences: CoopIS, C&TC, and ODBASE 2017, LNCS 10574, pp 371-382
|
2017 / 10 | ||
2017 / 9 |