Medical visual question answering via corresponding feature fusion combined with semantic attention

Math Biosci Eng. 2022 Jul 20;19(10):10192-10212. doi: 10.3934/mbe.2022478.

Abstract

Medical visual question answering (Med-VQA) aims to leverage a pre-trained artificial intelligence model to answer clinical questions raised by doctors or patients regarding radiology images. However, owing to the high professional requirements in the medical field and the difficulty of annotating medical data, Med-VQA lacks sufficient large-scale, well-annotated radiology images for training. Researchers have mainly focused on improving the ability of the model's visual feature extractor to address this problem. However, there are few researches focused on the textual feature extraction, and most of them underestimated the interactions between corresponding visual and textual features. In this study, we propose a corresponding feature fusion (CFF) method to strengthen the interactions of specific features from corresponding radiology images and questions. In addition, we designed a semantic attention (SA) module for textual feature extraction. This helps the model consciously focus on the meaningful words in various questions while reducing the attention spent on insignificant information. Extensive experiments demonstrate that the proposed method can achieve competitive results in two benchmark datasets and outperform existing state-of-the-art methods on answer prediction accuracy. Experimental results also prove that our model is capable of semantic understanding during answer prediction, which has certain advantages in Med-VQA.

Keywords: long short-term memory; multimodal learning; pre-training model; residual network; semantic attention.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Artificial Intelligence*
  • Attention
  • Humans
  • Semantics*