Web-based entity resolution, particularly in the context of online marketplaces and e-commerce ecosystems, is a critical task for accurately identifying and matching similar product offers across the web. Traditional approaches to entity resolution have primarily relied on textual information, but the increasing availability of diverse data modalities has led to the adoption of a multimodal approach. This paper introduces an innovative intermediate fusion architecture for multimodal product matching,
effectively combining textual information from RoBERTa embeddings and visual information from Swin-Transformer embeddings. Our approach enhances matching accuracy by leveraging the complementary nature of text and image modalities. Experimental results on the WDC Shoes and Zalando datasets show the superiority of our proposed approach compared to unimodal models and multimodal baselines. The outcomes highlight the potential for multimodal product matching to improve entity resolution in online marketplaces, thereby enhancing the user shopping experience.