Unsupervised Visual-Textual Correlation Learning With Fine-Grained Semantic Alignment

IEEE Trans Cybern. 2022 May;52(5):3669-3683. doi: 10.1109/TCYB.2020.3015084. Epub 2022 May 19.

Abstract

With the rapid growth of multimedia data on the Internet, there has been a rapid rise in the demand for visual-textual cross-media retrieval between images and sentences. However, the heterogeneous property of visual and textual data brings huge challenges to measure the cross-media similarity for retrieval. Although existing methods have achieved great progress with the strong learning ability of the deep neural network, they rely heavily on the scale of training data with manual annotation, that is, either pairwise image-sentence annotation or category annotation as supervised information for visual-textual correlation learning, which are extremely labor and time consuming to collect. Without any pairwise or category annotation, it is highly challenging to construct a correlation between images and sentences due to their inconsistent distributions and representations. But people can naturally understand the correlation between visual and textual data in high-level semantic, and those images and sentences containing the same group of semantic concepts can be easily matched in human brain. Inspired by the above human cognitive process, this article proposes an unsupervised visual-textual correlation learning (UVCL) approach to construct correlations without any manual annotation. The contributions are summarized as follows: 1) unsupervised semantic-guided cross-media correlation mining is proposed to bridge the heterogeneous gap between visual and textual data. We measure the semantic matching degree between images and sentences, and generate descriptive sentences according to the concepts extracted from images to further augment the training data in an unsupervised manner. Therefore, the approach can exploit the semantic knowledge within both visual and textual data to reduce the gap between them for further correlation learning and 2) unsupervised visual-textual fine-grained semantic alignment is proposed to learn cross-media correlation by aligning the fine-grained visual local patches and textual keywords with fine-grained soft attention as well as semantic-guided hard attention, and the results can effectively highlight the fine-grained semantic information within both images and sentences to boost visual-textual alignment. Extensive experiments are conducted to perform visual-textual cross-media retrieval in unsupervised setting without any manual annotation on two widely used datasets, namely, Flickr-30K and MS-COCO, which verify the effectiveness of our proposed UVCL approach.

MeSH terms

  • Brain
  • Humans
  • Neural Networks, Computer*
  • Semantics*