Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering

Artif Intell Med. 2023 Oct:144:102667. doi: 10.1016/j.artmed.2023.102667. Epub 2023 Sep 18.

Abstract

Insufficient training data is a common barrier to effectively learn multimodal information interactions and question semantics in existing medical Visual Question Answering (VQA) models. This paper proposes a new Asymmetric Cross Modal Attention network called ACMA, which constructs an image-guided attention and a question-guided attention to improve multimodal interactions from insufficient data. In addition, a Semantic Understanding Auxiliary (SUA) in the question-guided attention is newly designed to learn rich semantic embeddings for improving model performance on question understanding by integrating word-level and sentence-level information. Moreover, we propose a new data augmentation method called Multimodal Augmented Mixup (MAM) to train the ACMA, denoted as ACMA-MAM. The MAM incorporates various data augmentations and a vanilla mixup strategy to generate more non-repetitive data, which avoids time-consuming artificial data annotations and improves model generalization capability. Our ACMA-MAM outperforms state-of-the-art models on three publicly accessible medical VQA datasets (VQA-Rad, VQA-Slake, and PathVQA) with accuracies of 76.14 %, 83.13 %, and 53.83 % respectively, achieving improvements of 2.00 %, 1.32 %, and 1.59 % accordingly. Moreover, our model achieves F1 scores of 78.33 %, 82.83 %, and 51.86 %, surpassing the state-of-the-art models by 2.80 %, 1.15 %, and 1.37 % respectively.

Keywords: Cross modal attention; Data augmentation; Medical Visual Question Answering; Mixup; Multimodal interaction.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Learning*
  • Semantics*