Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Jiangfeng Li; Zijian Zhang; Bowen Wang; Qinpei Zhao; Chenxi Zhang

doi:10.3390/e24060764

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Entropy (Basel). 2022 May 29;24(6):764. doi: 10.3390/e24060764.

Authors

Jiangfeng Li¹, Zijian Zhang², Bowen Wang¹, Qinpei Zhao¹, Chenxi Zhang¹

Affiliations

¹ School of Software Engineering, Tongji University, Shanghai 201804, China.
² Meituan-Dianping Group, Shanghai 200050, China.

Abstract

Internet users are benefiting from technologies of abstractive summarization enabling them to view articles on the internet by reading article summaries only instead of an entire article. However, there are disadvantages to technologies for analyzing articles with texts and images due to the semantic gap between vision and language. These technologies focus more on aggregating features and neglect the heterogeneity of each modality. At the same time, the lack of consideration of intrinsic data properties within each modality and semantic information from cross-modal correlations result in the poor quality of learned representations. Therefore, we propose a novel Inter- and Intra-modal Contrastive Hybrid learning framework which learns to automatically align the multimodal information and maintains the semantic consistency of input/output flows. Moreover, ITCH can be taken as a component to make the model suitable for both supervised and unsupervised learning approaches. Experiments on two public datasets, MMS and MSMO, show that the ITCH performances are better than the current baselines.

Keywords: contrastive learning; cross-modal fusion; multimodal abstractive summarization; supervised and unsupervised learning.

Abstract

Grants and funding