A Hierarchical Spatial-Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Xiaoyu Teng; Xiaolin Gui; Pan Xu; Jianglei Tong; Jian An; Yang Liu; Huilan Jiang

doi:10.3390/s22218275

A Hierarchical Spatial-Temporal Cross-Attention Scheme for Video Summarization Using Contrastive Learning

Sensors (Basel). 2022 Oct 28;22(21):8275. doi: 10.3390/s22218275.

Authors

Xiaoyu Teng^{1

2}, Xiaolin Gui^{1

2}, Pan Xu^{1

2}, Jianglei Tong^{1

2}, Jian An^{1

2}, Yang Liu³, Huilan Jiang⁴

Affiliations

¹ Department of Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China.
² Shaanxi Province Key Laboratory of Computer Network, Xi'an Jiaotong University, Xi'an 710049, China.
³ Medical College, Northwest Minzu University, Lanzhou 730030, China.
⁴ ONYCOM Co., Ltd., Seoul 04519, Korea.

Abstract

Video summarization (VS) is a widely used technique for facilitating the effective reading, fast comprehension, and effective retrieval of video content. Certain properties of the new video data, such as a lack of prominent emphasis and a fuzzy theme development border, disturb the original thinking mode based on video feature information. Moreover, it introduces new challenges to the extraction of video depth and breadth features. In addition, the diversity of user requirements creates additional complications for more accurate keyframe screening issues. To overcome these challenges, this paper proposes a hierarchical spatial-temporal cross-attention scheme for video summarization based on comparative learning. Graph attention networks (GAT) and the multi-head convolutional attention cell are used to extract local and depth features, while the GAT-adjusted bidirection ConvLSTM (DB-ConvLSTM) is used to extract global and breadth features. Furthermore, a spatial-temporal cross-attention-based ConvLSTM is developed for merging hierarchical characteristics and achieving more accurate screening in similar keyframes clusters. Verification experiments and comparative analysis demonstrate that our method outperforms state-of-the-art methods.

Keywords: cross-attention; spatial–temporal features; video summarization.

MeSH terms

Algorithms*
Image Interpretation, Computer-Assisted* / methods
Video Recording / methods

Abstract

MeSH terms

Grants and funding