Webly Supervised Knowledge-Embedded Model for Visual Reasoning

IEEE Trans Neural Netw Learn Syst. 2023 Jan 23:PP. doi: 10.1109/TNNLS.2023.3236776. Online ahead of print.

Abstract

Visual reasoning between visual images and natural language remains a long-standing challenge in computer vision. Conventional deep supervision methods target at finding answers to the questions relying on the datasets containing only a limited amount of images with textual ground-truth descriptions. Facing learning with limited labels, it is natural to expect to constitute a larger scale dataset consisting of several million visual data annotated with texts, but this approach is extremely time-intensive and laborious. Knowledge-based works usually treat knowledge graphs (KGs) as static flattened tables for searching the answer, but fail to take advantage of the dynamic update of KGs. To overcome these deficiencies, we propose a Webly supervised knowledge-embedded model for the task of visual reasoning. On the one hand, vitalized by the overwhelming successful Webly supervised learning, we make much use readily available images from the Web with their weakly annotated texts for an effective representation. On the other hand, we design a knowledge-embedded model, including the dynamically updated interaction mechanism between semantic representation models and KGs. Experimental results on two benchmark datasets demonstrate that our proposed model significantly achieves the most outstanding performance compared with other state-of-the-art approaches for the task of visual reasoning.