Fusion of Multi-Modal Features to Enhance Dense Video Caption

Xuefei Huang; Ka-Hou Chan; Weifan Wu; Hao Sheng; Wei Ke

doi:10.3390/s23125565

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Sensors (Basel). 2023 Jun 14;23(12):5565. doi: 10.3390/s23125565.

Authors

Xuefei Huang¹, Ka-Hou Chan^{1

2}, Weifan Wu¹, Hao Sheng^{1

3

4}, Wei Ke^{1

2}

Affiliations

¹ Faculty of Applied Sciences, Macao Polytechnic University, Macau 999078, China.
² Engineering Research Centre of Applied Technology on Machine Translation and Artificial Intelligence of Ministry of Education, Macao Polytechnic University, Macau 999078, China.
³ State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China.
⁴ Beihang Hangzhou Innovation Institute Yuhang, Yuhang District, Hangzhou 310023, China.

Abstract

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.

Keywords: dense video caption; feature extraction; multi-modal feature fusion; neural network; video captioning.

Abstract

Grants and funding