Semantic Representation and Attention Alignment for Graph Information Bottleneck in Video Summarization

IEEE Trans Image Process. 2023:32:4170-4184. doi: 10.1109/TIP.2023.3293762. Epub 2023 Jul 20.

Abstract

End-to-end Long Short-Term Memory (LSTM) has been successfully applied to video summarization. However, the weakness of the LSTM model, poor generalization with inefficient representation learning for inputted nodes, limits its capability to efficiently carry out node classification within user-created videos. Given the power of Graph Neural Networks (GNNs) in representation learning, we adopted the Graph Information Bottle (GIB) to develop a Contextual Feature Transformation (CFT) mechanism that refines the temporal dual-feature, yielding a semantic representation with attention alignment. Furthermore, a novel Salient-Area-Size-based spatial attention model is presented to extract frame-wise visual features based on the observation that humans tend to focus on sizable and moving objects. Lastly, semantic representation is embedded within attention alignment under the end-to-end LSTM framework to differentiate indistinguishable images. Extensive experiments demonstrate that the proposed method outperforms State-Of-The-Art (SOTA) methods.