German English

MOMA

MOMA - A Mapping-based Object Matching System

Object matching or object consolidation is a crucial task for data integration and data cleaning. It addresses the problem of identifying object instances in data sources referring to the same real world entity. We propose a flexible framework called MOMA for mapping-based object matching. It allows the construction of match workflows combining the results of several matcher algorithms on both attribute values and contextual information. The output of a match task is an instance-level mapping that supports information fusion in P2P data integration systems and can be re-used for other match tasks. MOMA utilizes further semantic mappings of different cardinalities and provides merge and compose operators for mapping combination. We propose and evaluate several strategies for both object matching between different sources as well as for duplicate identification within a single data source.

Architecture

The above figure illustrates components of the MOMA architecture as well as the process of object matching. A mapping repository is used to materialize both association and same-mappings. Given the simple structure of our mappings they can efficiently be maintained in relational mapping tables. Many mappings already exist in data sources and can thus be utilized for object matching. For instance, DBLP data on publication lists for venues and for authors are kept as association mappings. Similarly, some same-mappings exist already, e.g., Google Scholar links its publications to ACM. MOMA also maintains a mapping cache for storing intermediate same-mappings derived during a match workflow. Hence, MOMA not only processes the input instances but also utilizes the mappings of the repository and the cache for object matching.

There is an extensible library of matcher algorithms that can be used for a specific match task. Matchers conform to the same interfaces as a match process, in particular they generate a same-mapping. Otherwise there is no restriction on the implementation of matchers, e.g., they can be attribute matchers or context matchers like graph-based matching algorithms. Furthermore, they may utilize a variety of similarity computations, e.g. based on string matching, use of auxiliary information like dictionaries or use of query functionality to access data sources. In our current implementation, we use a generic attribute matcher that is provided with a pair of attributes to be matched, a similarity function to be evaluated (e.g. n-gram, TF/IDF or affix) and a similarity threshold to be exceeded by result correspondences. A multi-attribute matcher is also supported which directly evaluates and combines the similarity for multiple attribute pairs, e.g. for publication title and publication year.

The MOMA match process is a workflow consisting of a sequence of steps. Each such step generates a same-mapping that can be refined by additional steps. Selected workflows can be added to the matcher library for use in other match tasks. The final same-mapping determined by a match process is stored in the mapping repository and can be re-used in other workflows.

Each workflow step consists of two parts: matcher execution and mapping combination. The execution of selected matchers is actually optional, i.e., a step may only combine existing or previously computed mappings from the mapping repository or mapping cache. The combination of mappings in a step is processed by a specific mapping combiner. The input of a mapping combiner is a list of mappings, either from the mapping cache or mapping repository, the output is a same-mapping. A combiner is specified by a mapping operator followed by an optional selection. The mapping operator specifies how the resulting correspondences are determined from the input mappings, e.g. by a merge or compose operation. The selection step filters the correspondences to restrict the mapping to the most similar instances.

MOMA provides a high flexibility to determine a tailored workflow for a given match task. In particular, it allows selection and combination of several matchers and the re-use and refinement of previously determined mappings.

Publications

PDF

Google Scholar
Thor, A.; Rahm, E.
CloudFuice: A flexible Cloud-based Data Integration System
Proc. of 10th Intl. Conference on Web Engineering (ICWE), 2011
2011-06
PDF

Google Scholar
Rahm, E.; Thor, A.; Aumueller, D.
Dynamic Fusion of Web Data
Proc. 5th Intl. XML Database Symposium (XSym), 2007
2007-09
PDF

Google Scholar
publication iconThor, Andreas; Aumueller, David; Rahm, Erhard
Data Integration Support for Mashups
Proc. 6th Intl. Workshop on Information Integration on the Web (IIWeb), 2007
2007-07
PDF

Google Scholar
Kirsten, T.; Thor, A.; Rahm, E.
Instance-based matching of large life science ontologies
Proc. of 4th Intl. Workshop on Data Integration in the Life Sciences (DILS), 2007
2007-06
PDF

Google Scholar
Thor, A.; Kirsten, T.; Rahm, E.
Instance-based matching of hierarchical ontologies
Proc. of 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW), 2007
2007-03
PDF

Google Scholar
Köpcke, H.; Rahm, E.
Analyse von Zitierungshäufigkeiten für die Datenbankkonferenz BTW
Datenbank-Spektrum, 7. Jahrgang, Heft 20
2007-02
PDF

Google Scholar
Thor, A.; Rahm, E.
MOMA - A Mapping-based Object Matching System
Proc. 3rd Conference on Innovative Data Systems Research (CIDR), 2007
2007-01
PDF
further information
Google Scholar
Kirsten, Toralf; Rahm, Erhard
BioFuice: Mapping-based data integration in bioinformatics
Proc. of 3rd Int. Workshop on Data Integration in the Life Sciences (DILS), Springer LNCS 4075, 2006
2006-07
PDF
further information
Google Scholar
Rahm, E.; Thor, A.
Citation analysis of database publications
ACM Sigmod Record 24(4), 2005
2005-12
PDF

Google Scholar
Rahm, E.; Thor, A.; Aumueller, D.; Do, H.H.; Golovin, N.; Kirsten, T.
iFuice - Information Fusion utilizing Instance Correspondences and Peer Mappings
Proc. 8th Intl. Workshop on the Web and Databases (WebDB), 2005
2005-06
PDF

Google Scholar
publication iconKirsten, T.; Rahm, E.
BioFuice: A decentralized Approach to integrate molecular-biological Data
Proc 4th Research Festival for Life Sciences, Leipzig, Dec. 2005
2005