Entity-Graph Enhanced Cross-Modal Pretraining for Instance-Level Product Retrieval

Xiao Dong; Xunlin Zhan; Yunchao Wei; Xiaoyong Wei; Yaowei Wang; Minlong Lu; Xiaochun Cao; Xiaodan Liang

doi:10.1109/TPAMI.2023.3291237

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-Level Product Retrieval

IEEE Trans Pattern Anal Mach Intell. 2023 Nov;45(11):13117-13133. doi: 10.1109/TPAMI.2023.3291237. Epub 2023 Oct 3.

Authors

Xiao Dong, Xunlin Zhan, Yunchao Wei, Xiaoyong Wei, Yaowei Wang, Minlong Lu, Xiaochun Cao, Xiaodan Liang

PMID: 37390000
DOI: 10.1109/TPAMI.2023.3291237

Abstract

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets and define two real practical instance-level retrieval tasks that enable evaluations on price comparison and personalized recommendations. For both instance-level tasks, accurately identifying the intended product target mentioned in visual-linguistic data and mitigating the impact of irrelevant content are quite challenging. To address this, we devise a more effective cross-modal pretraining model capable of adaptively incorporating key concept information from multi-modal data. This is accomplished by utilizing an entity graph, where nodes represented entities and edges denoted the similarity relations between them. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, which explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer. This could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantics. Experimental results sufficiently verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP Radford et al. 2021, UNITER Chen et al. 2020 and CAPTURE Zhan et al. 2021.