Semantic guidance network for video captioning

Sci Rep. 2023 Sep 26;13(1):16076. doi: 10.1038/s41598-023-43010-3.

Abstract

video captioning is a more challenging task that aims to generate abundant natural language descriptions, and it has become a promising direction for artificial intelligence. However, most existing methods are prone to ignore the problems of visual information redundancy and scene information omission due to the limitation of the sampling strategies. To address this problem, a semantic guidance network for video captioning is proposed. More specifically, a novel scene frame sampling strategy is first proposed to select key scene frames. Then, the vision transformer encoder is applied to learn visual and semantic information with a global view, alleviating information loss of modeling long-range dependencies caused in the encoder's hidden layer. Finally, a non-parametric metric learning module is introduced to calculate the similarity value between the ground truth and the prediction result, and the model is optimized in an end-to-end manner. Experiments on the benchmark MSR-VTT and MSVD datasets show that the proposed method can effectively improve the description accuracy and generalization ability.