COM: Contrastive Masked-attention model for incomplete multimodal learning

Shuwei Qian; Chongjun Wang

doi:10.1016/j.neunet.2023.03.003

COM: Contrastive Masked-attention model for incomplete multimodal learning

Neural Netw. 2023 May:162:443-455. doi: 10.1016/j.neunet.2023.03.003. Epub 2023 Mar 5.

Authors

Shuwei Qian¹, Chongjun Wang²

Affiliations

¹ State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China; Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China. Electronic address: qiansw@smail.nju.edu.cn.
² State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China; Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China. Electronic address: chjwang@nju.edu.cn.

PMID: 36965274
DOI: 10.1016/j.neunet.2023.03.003

Abstract

Most multimodal learning methods assume that all modalities are always available in data. However, in real-world applications, the assumption is often violated due to privacy protection, sensor failure etc. Previous works for incomplete multimodal learning often suffer from one of the following drawbacks: introducing noise, lacking flexibility to missing patterns and failing to capture interactions between modalities. To overcome these challenges, we propose a COntrastive Masked-attention model (COM). The framework performs cross-modal contrastive learning with GAN-based augmentation to reduce modality gap, and employs a masked-attention model to capture interactions between modalities. The augmentation adapts cross-modal contrastive learning to suit incomplete case by a two-player game, improving the effectiveness of multimodal representations. Interactions between modalities are modeled by stacking self-attention blocks, and attention masks limit them on the observed modalities to avoid extra noise. All kinds of modality combinations share a unified architecture, so the model is flexible to different missing patterns. Extensive experiments on six datasets demonstrate the effectiveness and robustness of the proposed method for incomplete multimodal learning.

Keywords: Attention mechanism; Contrastive learning; Missing modality; Multimodal learning.

MeSH terms

Learning*
Privacy*