Learning Temporal Dynamics for Video Super-Resolution: A Deep Learning Approach

IEEE Trans Image Process. 2018 Mar 30. doi: 10.1109/TIP.2018.2820807. Online ahead of print.

Abstract

Video super-resolution (SR) aims at estimating a high-resolution (HR) video sequence from a low-resolution (LR) one. Given that deep learning has been successfully applied to the task of single image SR, which demonstrates the strong capability of neural networks for modeling spatial relation within one single image, the key challenge to conduct video SR is how to efficiently and effectively exploit the temporal dependency among consecutive LR frames other than the spatial relation. However, this remains challenging because complex motion is difficult to model and can bring detrimental effects if not handled properly. We tackle the problem of learning temporal dynamics from two aspects. First, we propose a temporal adaptive neural network that can adaptively determine the optimal scale of temporal dependency. Inspired by the Inception module in GoogLeNet [1], filters of various temporal scales are applied to the input LR sequence before their responses are adaptively aggregated, in order to fully exploit the temporal relation among consecutive LR frames. Second, we decrease the complexity of motion among neighboring frames using a spatial alignment network that can be end-to-end trained with the temporal adaptive network and has the merit of increasing the robustness to complex motion and the efficiency compared to competing image alignment methods. We provide a comprehensive evaluation of the temporal adaptation and the spatial alignment modules. We show the temporal adaptive design considerably improve SR quality over its plain counterparts, and the spatial alignment network is able to attain comparable SR performance with the sophisticated optical flow based approach, but requires much less running time. Overall our proposed model with learned temporal dynamics is shown to achieve state-of-the-art SR results in terms of not only spatial consistency but also temporal coherence on public video datasets. More information can be found in.