Concept-Aware Video Captioning: Describing Videos With Effective Prior Information

IEEE Trans Image Process. 2023:32:5366-5378. doi: 10.1109/TIP.2023.3307969. Epub 2023 Oct 2.

Abstract

Concepts, a collective term for meaningful words that correspond to objects, actions, and attributes, can act as an intermediary for video captioning. While many efforts have been made to augment video captioning with concepts, most methods suffer from limited precision of concept detection and insufficient utilization of concepts, which could provide caption generation with inaccurate and inadequate prior information. Considering these issues, we propose a Concept-awARE video captioning framework (CARE) to facilitate plausible caption generation. Based on the encoder-decoder structure, CARE detects concepts precisely via multimodal-driven concept detection (MCD) and offers sufficient prior information to caption generation by global-local semantic guidance (G-LSG). Specifically, we implement MCD by leveraging video-to-text retrieval and the multimedia nature of videos. To achieve G-LSG, given the concept probabilities predicted by MCD, we weight and aggregate concepts to mine the video's latent topic to affect decoding globally and devise a simple yet efficient hybrid attention module to exploit concepts and video content to impact decoding locally. Finally, to develop CARE, we emphasize on the knowledge transfer of a contrastive vision-language pre-trained model (i.e., CLIP) in terms of visual understanding and video-to-text retrieval. With the multi-role CLIP, CARE can outperform CLIP-based strong video captioning baselines with affordable extra parameter and inference latency costs. Extensive experiments on MSVD, MSR-VTT, and VATEX datasets demonstrate the versatility of our approach for different encoder-decoder networks and the superiority of CARE against state-of-the-art methods. Our code is available at https://github.com/yangbang18/CARE.