M 2 DCapsN: Multimodal, Multichannel, and Dual-Step Capsule Network for Natural Language Moment Localization

IEEE Trans Neural Netw Learn Syst. 2023 Apr 7:PP. doi: 10.1109/TNNLS.2023.3261927. Online ahead of print.

Abstract

Natural language moment localization aims to localize the target moment that matches a given natural language query in an untrimmed video. The key to this challenging task is to capture fine-grained video-language correlations to establish the alignment between the query and target moment. Most existing works establish a single-pass interaction schema to capture correlations between queries and moments. Considering the complex feature space of lengthy video and diverse information between frames, the weight distribution of information interaction flow is prone to dispersion or misalignment, which leads to redundant information flow affecting the final prediction. We address this issue by proposing a capsule-based approach to model the query-video interactions, termed the Multimodal, Multichannel, and Dual-step Capsule Network (M 2 DCapsN), which is derived from the intuition that "multiple people viewing multiple times is better than one person viewing one time." First, we introduce a multimodal capsule network, replacing the single-pass interaction schema of "one person viewing one time" with the iterative interaction schema of "one person viewing multiple times", which cyclically updates cross-modal interactions and modifies potential redundant interactions via its routing-by-agreement. Then, considering that the conventional routing mechanism only learns a single iterative interaction schema, we further propose a multichannel dynamic routing mechanism to learn multiple iterative interaction schemas, where each channel performs independent routing iteration to collectively capture cross-modal correlations from multiple subspaces, that is", multiple people viewing." Moreover, we design a dual-step capsule network structure based on the multimodal, multichannel capsule network, bringing together the query and query-guided key moments to jointly enhance the original video, so as to select the target moments according to the enhanced part. Experimental results on three public datasets demonstrate the superiority of our approach in comparison with state-of-the-art methods, and comprehensive ablation and visualization analysis validate the effectiveness of each component of the proposed model.