GSSTU: Generative Spatial Self-Attention Transformer Unit for Enhanced Video Prediction

IEEE Trans Neural Netw Learn Syst. 2024 Feb 27:PP. doi: 10.1109/TNNLS.2024.3359716. Online ahead of print.

Abstract

Future frame prediction is a challenging task in computer vision with practical applications in areas such as video generation, autonomous driving, and robotics. Traditional recurrent neural networks have limited effectiveness in capturing long-range dependencies between frames, and combining convolutional neural networks (CNNs) with recurrent networks has limitations in modeling complex dependencies. Generative adversarial networks have shown promising results, but they are computationally expensive and suffer from instability during training. In this article, we propose a novel approach for future frame prediction that combines the encoding capabilities of 3-D CNNs with the sequence modeling capabilities of Transformers. We also propose a spatial self-attention mechanism and a novel neighborhood pixel intensity loss to preserve structural information and local intensity, respectively. Our approach outperforms existing methods in terms of structural similarity (SSIM), peak signal-to-noise ratio (PSNR), and learned perceptual image patch similarity (LPIPS) scores on five public datasets. More precisely, our model exhibited an average improvement of 4.64%, 18.5%, and 42% concerning SSIM, PSNR, and LPIPS for the second most proficient method correspondingly, across all datasets. The results demonstrate the effectiveness of our proposed method in generating high-quality predictions of future frames.