Collaborative and Multilevel Feature Selection Network for Action Recognition

Zhenxing Zheng; Gaoyun An; Shan Cao; Dapeng Wu; Qiuqi Ruan

doi:10.1109/TNNLS.2021.3105184

Collaborative and Multilevel Feature Selection Network for Action Recognition

IEEE Trans Neural Netw Learn Syst. 2023 Mar;34(3):1304-1318. doi: 10.1109/TNNLS.2021.3105184. Epub 2023 Feb 28.

Authors

Zhenxing Zheng, Gaoyun An, Shan Cao, Dapeng Wu, Qiuqi Ruan

PMID: 34424850
DOI: 10.1109/TNNLS.2021.3105184

Abstract

The feature pyramid has been widely used in many visual tasks, such as fine-grained image classification, instance segmentation, and object detection, and had been achieving promising performance. Although many algorithms exploit different-level features to construct the feature pyramid, they usually treat them equally and do not make an in-depth investigation on the inherent complementary advantages of different-level features. In this article, to learn a pyramid feature with the robust representational ability for action recognition, we propose a novel collaborative and multilevel feature selection network (FSNet) that applies feature selection and aggregation on multilevel features according to action context. Unlike previous works that learn the pattern of frame appearance by enhancing spatial encoding, the proposed network consists of the position selection module and channel selection module that can adaptively aggregate multilevel features into a new informative feature from both position and channel dimensions. The position selection module integrates the vectors at the same spatial location across multilevel features with positionwise attention. Similarly, the channel selection module selectively aggregates the channel maps at the same channel location across multilevel features with channelwise attention. Positionwise features with different receptive fields and channelwise features with different pattern-specific responses are emphasized respectively depending on their correlations to actions, which are fused as a new informative feature for action recognition. The proposed FSNet can be inserted into different backbone networks flexibly, and extensive experiments are conducted on three benchmark action datasets, Kinetics, UCF101, and HMDB51. Experimental results show that FSNet is practical and can be collaboratively trained to boost the representational ability of existing networks. FSNet achieves superior performance against most top-tier models on Kinetics and all models on UCF101 and HMDB51.