Kolb, L. ; Rahm, E.

Parallel Entity Resolution with Dedoop

Datenbank-Spektrum 13 (1), 2013

2013 / 02

Andere

Futher information: http://dx.doi.org/10.1007/s13222-012-0110-x

Abstract

<p style="text-align:justify;">
We provide an overview of Dedoop (<u>De</u>duplication with Ha<u>doop</u>), a new tool for parallel entity resolution (ER) on cloud infrastructures. Dedoop supports a browser-based specification of complex ER strategies and provides a large library of blocking and matching approaches. To simplify the configuration of ER strategies with several similarity metrics, training-based machine learning approaches can be employed with Dedoop. Specified ER strategies are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. For improved performance, Dedoop supports redundancy-free multi-pass blocking as well as advanced load balancing approaches. To illustrate the usefulness of Dedoop, we present the results of a comparative evaluation of different ER strategies on a challenging real-world dataset.<br/><br/>Please visit our <a href="/dedoop#dedoop">project website</a> for further informations about Dedoop.
</p>

<h2>Keywords</h2>
<ul>
<li>MapReduce</li>
<li>Hadoop</li>
<li>Entity Resolution</li>
<li>Blocking</li>
<li>Data Skew</li>
<li>Load Balancing</li>
</ul>

<h2 id="bibtex_heading">BibTex</h2>
<pre id="bibtex_listing">
@article{DBLP:journals/dbsk/KolbR13,
author = {Lars Kolb and Erhard Rahm},
title = {{Parallel Entity Resolution with Dedoop}},
journal = {Datenbank-Spektrum},
volume = {13},
number = {1},
year = {2013},
pages = {23-32},
ee = {http://dx.doi.org/10.1007/s13222-012-0110-x},
bibsource = {DBLP, http://dblp.uni-trier.de}
}
</pre>