Multi-View Visual Question Answering with Active Viewpoint Selection

Yue Qiu; Yutaka Satoh; Ryota Suzuki; Kenji Iwata; Hirokatsu Kataoka

doi:10.3390/s20082281

Multi-View Visual Question Answering with Active Viewpoint Selection

Sensors (Basel). 2020 Apr 17;20(8):2281. doi: 10.3390/s20082281.

Authors

Yue Qiu^{1

2}, Yutaka Satoh^{1

2}, Ryota Suzuki², Kenji Iwata², Hirokatsu Kataoka²

Affiliations

¹ Graduate School of Science and Technology, University of Tsukuba, Tsukuba 305-8577, Japan.
² National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba 305-8560, Japan.

Abstract

This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human-robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.

Keywords: deep learning; human–robot interaction; reinforcement learning; three-dimensional (3D) vision; visual question answering.