Visual question generation for explicit questioning purposes based on target objects

Jiayuan Xie; Jiali Chen; Wenhao Fang; Yi Cai; Qing Li

doi:10.1016/j.neunet.2023.08.007

Visual question generation for explicit questioning purposes based on target objects

Neural Netw. 2023 Oct:167:638-647. doi: 10.1016/j.neunet.2023.08.007. Epub 2023 Aug 24.

Authors

Jiayuan Xie¹, Jiali Chen², Wenhao Fang², Yi Cai³, Qing Li⁴

Affiliations

¹ School of Software Engineering, South China University of Technology, Guangzhou, China; Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education, China; Department of Computing, Hong Kong Polytechnic University, Hong Kong, China.
² School of Software Engineering, South China University of Technology, Guangzhou, China; Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education, China.
³ School of Software Engineering, South China University of Technology, Guangzhou, China; Key Laboratory of Big Data and Intelligent Robot (South China University of Technology), Ministry of Education, China. Electronic address: ycai@scut.edu.cn.
⁴ Department of Computing, Hong Kong Polytechnic University, Hong Kong, China.

PMID: 37717321
DOI: 10.1016/j.neunet.2023.08.007

Abstract

Visual question generation aims to focus on some target objects in an image to generate questions with certain questioning purposes. Existing studies mainly utilize an answer to extract the target object corresponding to the questioning purpose for questioning. However, answers fail to accurately and completely map to every target object, such as the objects corresponding to the answer are ambiguous or the answers are the relationship between multiple objects. To address this problem, we propose a content-controlled question generation model, which generates questions based on a given target object set specified from an image. Considering that the target objects have different contributions during the generation process, we design a recurrent generative architecture to explicitly control attention to different objects and their corresponding image information at each generative stage. Extensive experiments on the VQA v2.0 dataset and the Visual7w dataset show that the proposed model outperforms the state-of-the-art models and can controllably generate questions with specified content.

Keywords: Questioning purposes; Target object; Visual question generation.