Multi-Hierarchical Category Supervision for Weakly-Supervised Temporal Action Localization

IEEE Trans Image Process. 2021:30:9332-9344. doi: 10.1109/TIP.2021.3124671. Epub 2021 Nov 12.

Abstract

Weakly Supervised Temporal Action Localization (WTAL) aims to localize action segments in untrimmed videos with only video-level category labels in the training phase. In WTAL, an action generally consists of a series of sub-actions, and different categories of actions may share the common sub-actions. However, to distinguish different categories of actions with only video-level class labels, current WTAL models tend to focus on discriminative sub-actions of the action, while ignoring those common sub-actions shared with different categories of actions. This negligence of common sub-actions would lead to the located action segments incomplete, i.e., only containing discriminative sub-actions. Different from current approaches of designing complex network architectures to explore more complete actions, in this paper, we introduce a novel supervision method named multi-hierarchical category supervision (MHCS) to find more sub-actions rather than only the discriminative ones. Specifically, action categories sharing similar sub-actions will be constructed as super-classes through hierarchical clustering. Hence, training with the new generated super-classes would encourage the model to pay more attention to the common sub-actions, which are ignored training with the original classes. Furthermore, our proposed MHCS is model-agnostic and non-intrusive, which can be directly applied to existing methods without changing their structures. Through extensive experiments, we verify that our supervision method can improve the performance of four state-of-the-art WTAL methods on three public datasets: THUMOS14, ActivityNet1.2, and ActivityNet1.3.