Discriminative Multi-View Dynamic Image Fusion for Cross-View 3-D Action Recognition

IEEE Trans Neural Netw Learn Syst. 2022 Oct;33(10):5332-5345. doi: 10.1109/TNNLS.2021.3070179. Epub 2022 Oct 5.

Abstract

Dramatic imaging viewpoint variation is the critical challenge toward action recognition for depth video. To address this, one feasible way is to enhance view-tolerance of visual feature, while still maintaining strong discriminative capacity. Multi-view dynamic image (MVDI) is the most recently proposed 3-D action representation manner that is able to compactly encode human motion information and 3-D visual clue well. However, it is still view-sensitive. To leverage its performance, a discriminative MVDI fusion method is proposed by us via multi-instance learning (MIL). Specifically, the dynamic images (DIs) from different observation viewpoints are regarded as the instances for 3-D action characterization. After being encoded using Fisher vector (FV), they are then aggregated by sum-pooling to yield the representative 3-D action signature. Our insight is that viewpoint aggregation helps to enhance view-tolerance. And, FV can map the raw DI feature to the higher dimensional feature space to promote the discriminative power. Meanwhile, a discriminative viewpoint instance discovery method is also proposed to discard the viewpoint instances unfavorable for action characterization. The wide-range experiments on five data sets demonstrate that our proposition can significantly enhance the performance of cross-view 3-D action recognition. And, it is also applicable to cross-view 3-D object recognition. The source code is available at https://github.com/3huo/ActionView.