Restricting the Overlap of Top-N Sets in Schema Matching
Proc. EDBT workhop on New Trends in Similarity Search (NTSS 2011)
Futher information: http://dl.acm.org/authorize?481097=
Computing similarities between metadata elements is an essential process in schema and ontology matching systems. Such systems aim at reducing the manual effort of finding mappings for data integration or ontology alignment. Similarity measures compute syntactic, semantic or structural similarities of metadata elements. Typically, different similarities are combined and the most similar element pairs are selected to produce a best-1 mapping suggestion.
Unfortunately automatic schema matching systems are only rarely commercially adopted since correcting the best-1mapping suggestion is often harder than finding the mapping manually. To alleviate this, schema matching must be used incrementally by computing Top-N mapping suggestions that the user can select from. However, current similarity measures and selection operators suggest the same target elements for many different source elements. This effect, that we call overlap, reduces the quality of schema matching significantly.
To address this problem, we introduce a new weighted token similarity measure that implicitly decreases the overlap between Top-N sets. Secondly, a new Top-N selection operator is introduced that is able to increase the recall by restricting overlap directly. We evaluate our approaches on large-sized, real world matching problems and show the positive effect on match quality.