Nentwig, M. ; Rahm, E.

Incremental Clustering on Linked Data

Proc. IEEE International Conference on Data Mining Workshop, ICDMW 2018, Singapore

2018 / 11



Data integration in the Web of Data is not limited to the pairwise linking of entities but often requires to cluster entities of different sources, e. g., within knowledge graphs. Such entity clustering should not only be scalable to large data volumes and many sources but also be dynamic to deal with continuously changing sources and the ability to incorporate new sources. Previous entity clustering approaches are mostly static focusing on the one-time linking and clustering of entities from few sources. In this paper, we propose and evaluate new scalable approaches for incremental entity clustering that support the continuous addition of new entities and data sources. The implementation is based on the distributed processing framework Apache Flink. A detailed performance evaluation with real and synthetically customized datasets shows the effectiveness and scalability of the incremental clustering approaches.