Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited

Xing Xu; Kaiyi Lin; Yang Yang; Alan Hanjalic; Heng Tao Shen

doi:10.1109/TPAMI.2020.3045530

Joint Feature Synthesis and Embedding: Adversarial Cross-Modal Retrieval Revisited

IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):3030-3047. doi: 10.1109/TPAMI.2020.3045530. Epub 2022 May 5.

Authors

Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, Heng Tao Shen

PMID: 33332264
DOI: 10.1109/TPAMI.2020.3045530

Abstract

Recently, generative adversarial network (GAN) has shown its strong ability on modeling data distribution via adversarial learning. Cross-modal GAN, which attempts to utilize the power of GAN to model the cross-modal joint distribution and to learn compatible cross-modal features, is becoming the research hotspot. However, the existing cross-modal GAN approaches typically 1) require labeled multimodal data of massive labor cost to establish cross-modal correlation; 2) utilize the vanilla GAN model that results in unstable training procedure and meaningless synthetic features; and 3) lack of extensibility for retrieving cross-modal data of new classes. In this article, we revisit the adversarial learning in existing cross-modal GAN methods and propose Joint Feature Synthesis and Embedding (JFSE), a novel method that jointly performs multimodal feature synthesis and common embedding space learning to overcome the above three shortcomings. Specifically, JFSE deploys two coupled conditional Wassertein GAN modules for the input data of two modalities, to synthesize meaningful and correlated multimodal features under the guidance of the word embeddings of class labels. Moreover, three advanced distribution alignment schemes with advanced cycle-consistency constraints are proposed to preserve the semantic compatibility and enable the knowledge transfer in the common embedding space for both the true and synthetic cross-modal features. All these add-ons in JFSE not only help to learn more effective common embedding space that effectively captures the cross-modal correlation but also facilitate to transfer knowledge to multimodal data of new classes. Extensive experiments are conducted on four widely used cross-modal datasets, and the comparisons with more than ten state-of-the-art approaches show that our JFSE method achieves remarkably accuracy improvement on both standard retrieval and the newly explored zero-shot and generalized zero-shot retrieval tasks.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Learning
Machine Learning*
Semantics