Keyword-Based Diverse Image Retrieval With Variational Multiple Instance Graph

Yawen Zeng; Yiru Wang; Dongliang Liao; Gongfu Li; Weijie Huang; Jin Xu; Da Cao; Hong Man

doi:10.1109/TNNLS.2022.3168431

Keyword-Based Diverse Image Retrieval With Variational Multiple Instance Graph

IEEE Trans Neural Netw Learn Syst. 2023 Dec;34(12):10528-10537. doi: 10.1109/TNNLS.2022.3168431. Epub 2023 Nov 30.

Authors

Yawen Zeng, Yiru Wang, Dongliang Liao, Gongfu Li, Weijie Huang, Jin Xu, Da Cao, Hong Man

PMID: 35482693
DOI: 10.1109/TNNLS.2022.3168431

Abstract

The task of cross-modal image retrieval has recently attracted considerable research attention. In real-world scenarios, keyword-based queries issued by users are usually short and have broad semantics. Therefore, semantic diversity is as important as retrieval accuracy in such user-oriented services, which improves user experience. However, most typical cross-modal image retrieval methods based on single point query embedding inevitably result in low semantic diversity, while existing diverse retrieval approaches frequently lead to low accuracy due to a lack of cross-modal understanding. To address this challenge, we introduce an end-to-end solution termed variational multiple instance graph (VMIG), in which a continuous semantic space is learned to capture diverse query semantics, and the retrieval task is formulated as a multiple instance learning problems to connect diverse features across modalities. Specifically, a query-guided variational autoencoder is employed to model the continuous semantic space instead of learning a single-point embedding. Afterward, multiple instances of the image and query are obtained by sampling in the continuous semantic space and applying multihead attention, respectively. Thereafter, an instance graph is constructed to remove noisy instances and align cross-modal semantics. Finally, heterogeneous modalities are robustly fused under multiple losses. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution in both retrieval accuracy and semantic diversity.