Cross-Modal Retrieval With Partially Mismatched Pairs

IEEE Trans Pattern Anal Mach Intell. 2023 Aug;45(8):9595-9610. doi: 10.1109/TPAMI.2023.3247939. Epub 2023 Jun 30.

Abstract

In this paper, we study a challenging but less-touched problem in cross-modal retrieval, i.e., partially mismatched pairs (PMPs). Specifically, in real-world scenarios, a huge number of multimedia data (e.g., the Conceptual Captions dataset) are collected from the Internet, and thus it is inevitable to wrongly treat some irrelevant cross-modal pairs as matched. Undoubtedly, such a PMP problem will remarkably degrade the cross-modal retrieval performance. To tackle this problem, we derive a unified theoretical Robust Cross-modal Learning framework (RCL) with an unbiased estimator of the cross-modal retrieval risk, which aims to endow the cross-modal retrieval methods with robustness against PMPs. In detail, our RCL adopts a novel complementary contrastive learning paradigm to address the following two challenges, i.e., the overfitting and underfitting issues. On the one hand, our method only utilizes the negative information which is much less likely false compared with the positive information, thus avoiding the overfitting issue to PMPs. However, these robust strategies could induce underfitting issues, thus making training models more difficult. On the other hand, to address the underfitting issue brought by weak supervision, we present to leverage of all available negative pairs to enhance the supervision contained in the negative information. Moreover, to further improve the performance, we propose to minimize the upper bounds of the risk to pay more attention to hard samples. To verify the effectiveness and robustness of the proposed method, we carry out comprehensive experiments on five widely-used benchmark datasets compared with nine state-of-the-art approaches w.r.t. the image-text and video-text retrieval tasks. The code is available at https://github.com/penghu-cs/RCL.

MeSH terms

  • Algorithms*
  • Benchmarking*
  • Internet
  • Learning