Memory Augmented Deep Recurrent Neural Network for Video Question Answering

IEEE Trans Neural Netw Learn Syst. 2020 Sep;31(9):3159-3167. doi: 10.1109/TNNLS.2019.2938015. Epub 2019 Sep 20.

Abstract

Video question answering (VideoQA) is a very important but challenging multimedia task, which automatically analyzes questions and videos and generates accurate answers. However, research on VideoQA is still in its infancy. In this article, we propose a novel memory augmented deep recurrent neural network (MA-DRNN) model for VideoQA, which features a new method for encoding videos and questions, and memory augmentation using the emerging differentiable neural computer (DNC). Specifically, we encode textual (questions) information before visual (videos) information, which leads to better visual-textual representations. Moreover, we leverage DNC (with an external memory) for storing and retrieving useful information in questions and videos, and modeling the long-term visual-textual dependence. To evaluate the proposed model, we conducted extensive experiments using the VTW data set and MSVD-QA data set, which are both Widely used large-scale video data sets for language-level understanding. The experimental results have well validated the proposed model and showed that it outperforms the state-of-the-art in terms of various accuracy-related metrics.