MAVA: Multi-level Adaptive Visual-textual Alignment by Cross-media Bi-attention Mechanism

IEEE Trans Image Process. 2019 Nov 22. doi: 10.1109/TIP.2019.2952085. Online ahead of print.

Abstract

The rapidly developing information technology leads to a fast growth of visual and textual contents, and it comes with huge challenges to make correlation and perform crossmedia retrieval between images and sentences. Existing methods mainly explore cross-media correlation from either global-level instances as the whole images and sentences, or local-level fine-grained patches as the discriminative image regions and key words, which ignore the complementary information from the relation between local-level fine-grained patches. Naturally, relation understanding is highly important for learning crossmedia correlation. People focus on not only the alignment between discriminative image regions and key words, but also their relations lying in the visual and textual context. Therefore, in this paper, we propose Multi-level Adaptive Visual-textual Alignment (MAVA) approach with the following contributions. First, we propose cross-media multi-pathway fine-grained network to extract not only the local fine-grained patches as discriminative image regions and key words, but also visual relations between image regions as well as textual relations from the context of sentences, which contain complementary information to exploit fine-grained characteristics within different media types. Second, we propose visual-textual bi-attention mechanism to distinguish the fine-grained information with different saliency from both local and relation levels, which can provide more discriminative hints for correlation learning. Third, we propose cross-media multi-level adaptive alignment to explore global, local and relation alignments. An adaptive alignment strategy is further proposed to enhance the matched pairs of different media types, and discard those misalignments adaptively to learn more precise cross-media correlation. Extensive experiments are conducted to perform image-sentence matching on 2 widely-used cross-media datasets, namely Flickr-30K and MS-COCO, comparing with 10 state-of-the-art methods, which can fully verify the effectiveness of our proposed MAVA approach.