Medical visual question answering via corresponding feature fusion combined with semantic attention

Han Zhu; Xiaohai He; Meiling Wang; Mozhi Zhang; Linbo Qing

doi:10.3934/mbe.2022478

Medical visual question answering via corresponding feature fusion combined with semantic attention

Math Biosci Eng. 2022 Jul 20;19(10):10192-10212. doi: 10.3934/mbe.2022478.

Authors

Han Zhu¹, Xiaohai He¹, Meiling Wang¹, Mozhi Zhang², Linbo Qing¹

Affiliations

¹ College of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China.
² Department of Computer Science, University of Maryland, College Park, MD 20742, USA.

PMID: 36031991
DOI: 10.3934/mbe.2022478

Abstract

Medical visual question answering (Med-VQA) aims to leverage a pre-trained artificial intelligence model to answer clinical questions raised by doctors or patients regarding radiology images. However, owing to the high professional requirements in the medical field and the difficulty of annotating medical data, Med-VQA lacks sufficient large-scale, well-annotated radiology images for training. Researchers have mainly focused on improving the ability of the model's visual feature extractor to address this problem. However, there are few researches focused on the textual feature extraction, and most of them underestimated the interactions between corresponding visual and textual features. In this study, we propose a corresponding feature fusion (CFF) method to strengthen the interactions of specific features from corresponding radiology images and questions. In addition, we designed a semantic attention (SA) module for textual feature extraction. This helps the model consciously focus on the meaningful words in various questions while reducing the attention spent on insignificant information. Extensive experiments demonstrate that the proposed method can achieve competitive results in two benchmark datasets and outperform existing state-of-the-art methods on answer prediction accuracy. Experimental results also prove that our model is capable of semantic understanding during answer prediction, which has certain advantages in Med-VQA.

Keywords: long short-term memory; multimodal learning; pre-training model; residual network; semantic attention.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Artificial Intelligence*
Attention
Humans
Semantics*