COM: Contrastive Masked-attention model for incomplete multimodal learning

Neural Netw. 2023 May:162:443-455. doi: 10.1016/j.neunet.2023.03.003. Epub 2023 Mar 5.

Abstract

Most multimodal learning methods assume that all modalities are always available in data. However, in real-world applications, the assumption is often violated due to privacy protection, sensor failure etc. Previous works for incomplete multimodal learning often suffer from one of the following drawbacks: introducing noise, lacking flexibility to missing patterns and failing to capture interactions between modalities. To overcome these challenges, we propose a COntrastive Masked-attention model (COM). The framework performs cross-modal contrastive learning with GAN-based augmentation to reduce modality gap, and employs a masked-attention model to capture interactions between modalities. The augmentation adapts cross-modal contrastive learning to suit incomplete case by a two-player game, improving the effectiveness of multimodal representations. Interactions between modalities are modeled by stacking self-attention blocks, and attention masks limit them on the observed modalities to avoid extra noise. All kinds of modality combinations share a unified architecture, so the model is flexible to different missing patterns. Extensive experiments on six datasets demonstrate the effectiveness and robustness of the proposed method for incomplete multimodal learning.

Keywords: Attention mechanism; Contrastive learning; Missing modality; Multimodal learning.

MeSH terms

  • Learning*
  • Privacy*