MOMA - A Mapping-based Object Matching System
Object matching or object consolidation is a crucial task for data integration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible framework called MOMA for mapping-based object matching. It allows the construction of match workflows combining the results of several matcher algorithms on both attribute values and contextual information. The output of a match task is an instance-level mapping that supports information fusion in P2P data integration systems and can be re-used for other match tasks. MOMA utilizes further semantic mappings of different cardinalities and provides merge and compose operators for mapping combination. We propose and evaluate several strategies for both object matching between different sources as well as for duplicate identification within a single data source.
The above figure illustrates components of the MOMA architecture as well as the process of object matching. A mapping repository is used to materialize both association and same-mappings. Given the simple structure of our mappings they can efficiently be maintained in relational mapping tables. Many mappings already exist in data sources and can thus be utilized for object matching. For instance, DBLP data on publication lists for venues and for authors are kept as association mappings. Similarly, some same-mappings exist already, e.g., Google Scholar links its publications to ACM. MOMA also maintains a mapping cache for storing intermediate same-mappings derived during a match workflow. Hence, MOMA not only processes the input instances but also utilizes the mappings of the repository and the cache for object matching.
There is an extensible library of matcher algorithms that can be used for a specific match task. Matchers conform to the same interfaces as a match process, in particular they generate a same-mapping. Otherwise there is no restriction on the implementation of matchers, e.g., they can be attribute matchers or context matchers like graph-based matching algorithms. Furthermore, they may utilize a variety of similarity computations, e.g. based on string matching, use of auxiliary information like dictionaries or use of query functionality to access data sources. In our current implementation, we use a generic attribute matcher that is provided with a pair of attributes to be matched, a similarity function to be evaluated (e.g. n-gram, TF/IDF or affix) and a similarity threshold to be exceeded by result correspondences. A multi-attribute matcher is also supported which directly evaluates and combines the similarity for multiple attribute pairs, e.g. for publication title and publication year.
The MOMA match process is a workflow consisting of a sequence of steps. Each such step generates a same-mapping that can be refined by additional steps. Selected workflows can be added to the matcher library for use in other match tasks. The final same-mapping determined by a match process is stored in the mapping repository and can be re-used in other workflows.
Each workflow step consists of two parts: matcher execution and mapping combination. The execution of selected matchers is actually optional, i.e., a step may only combine existing or previously computed mappings from the mapping repository or mapping cache. The combination of mappings in a step is processed by a specific mapping combiner. The input of a mapping combiner is a list of mappings, either from the mapping cache or mapping repository, the output is a same-mapping. A combiner is specified by a mapping operator followed by an optional selection. The mapping operator specifies how the resulting correspondences are determined from the input mappings, e.g. by a merge or compose operation. The selection step filters the correspondences to restrict the mapping to the most similar instances.
MOMA provides a high flexibility to determine a tailored workflow for a given match task. In particular, it allows selection and combination of several matchers and the re-use and refinement of previously determined mappings.