Köpcke, H.

Object Matching on Real-world Problems

Dissertation, Universität Leipzig

2014 / 05

Dissertation

Futher information: https://katalog.ub.uni-leipzig.de/Record/0013004879

Abstract

Object matching (also referred to as duplicate identification, record linkage, entity resolution or reference reconciliation) is a crucial task for data integration and data cleaning. The task is to detect multiple representations of the same real-world object. This is a challenging task particularly for objects that are highly heterogeneous and of limited data quality, e.g., regarding completeness and consistency of their descriptions.

To gain a better overview about the current state of the art in object matching, we survey the existing frameworks and their evaluations. According to the defined criteria, we review various frameworks published in the literature. We characterize them in some detail and compare them with each other and with our own framework, FEVER.

With FEVER we introduce a new generic and comprehensive framework for object matching and comparative object matching evaluation. FEVER offers numerous operators for constructing non-learning as well as learning-based match workflows. Moreover, FEVER allows match approaches to be automatically executed and evaluated under different parameter configurations. Therefore FEVER sets the platform for conducting a comparative evaluation on the relative effectiveness and efficiency of alternate match approaches.

Despite the huge amount of recent research efforts on object matching there has not yet been such an evaluation. With FEVER we fill this gap and present an evaluation of existing implementations on challenging real-world match tasks. We use the FEVER framework to automatically execute the approaches and to find favourable parameter settings in a comparable way. We consider approaches both with and without using machine learning to find suitable parameterization and combination of similarity functions. In addition to approaches from the research community we also consider a state-of-the-art commercial object matching implementation. Our results indicate significant quality and efficiency differences between different approaches. We also find that some challenging matching tasks such as matching product offers from online shops are not sufficiently solved with conventional approaches based on the similarity of attribute values.

Furthermore, this thesis addresses the product offer matching problem. Product offer matching is a special case of object matching that is needed to identify equivalent offers referring to the same real-world product. The thesis proposes a tailored overall approach for the product offer matching problem. The approach supports category-specific match strategies based on two pillars: a comprehensive preprocessing and machine learning. The preprocessing extracts and cleans new attributes usable for matching. In particular, the approach extracts and uses so-called product codes to identify products and distinguish them from similar product variations. After the preprocessing machine learning is employed to semi-automatically determine a match strategy utilizing several attributes and similarity functions.

<h2 id="bibtex_heading">BibTex</h2>
<pre id="bibtex_listing">
@PHDTHESIS{koepcke_phdthesis2014,
author = {Hanna Köpcke},
title = {Object Matching on Real-world Problems},
school = {Universität Leipzig},
year = {2014}
}