Kolb, L. ; Köpcke, H. ; Thor, A. ; Rahm, E.

Learning-based Entity Resolution with MapReduce

Proc. 3rd Intl. Workshop on Cloud Data Management (CloudDB), 2011

2011 / 10



Entity resolution is a crucial step for data quality and data integration. Learning-based approaches show high effectiveness at the expense of poor efficiency. To reduce the typically high execution times, we investigate how learning-based entity resolution can be realized in a cloud infrastructure using MapReduce. We propose and evaluate two efficient MapReduce-based strategies for pair-wise similarity computation and classifier application on the Cartesian product of two input sources. Our evaluation is based on real-world datasets and shows the high efficiency and effectiveness of the proposed approaches.

<li>MapReduce, Hadoop</li>
<li>Entity Resolution, Object matching, Similarity Join, Pair-wise comparison</li>
<li>Cartesian product</li>
<li>Machine Learning, Classification</li>

