Seeking a Hierarchical Prototype for Multimodal Gesture Recognition

Yunan Li; Tianyu Qi; Zhuoqi Ma; Dou Quan; Qiguang Miao

doi:10.1109/TNNLS.2023.3295811

Seeking a Hierarchical Prototype for Multimodal Gesture Recognition

IEEE Trans Neural Netw Learn Syst. 2023 Jul 26:PP. doi: 10.1109/TNNLS.2023.3295811. Online ahead of print.

Authors

Yunan Li, Tianyu Qi, Zhuoqi Ma, Dou Quan, Qiguang Miao

PMID: 37494175
DOI: 10.1109/TNNLS.2023.3295811

Abstract

Gesture recognition has drawn considerable attention from many researchers owing to its wide range of applications. Although significant progress has been made in this field, previous works always focus on how to distinguish between different gesture classes, ignoring the influence of inner-class divergence caused by gesture-irrelevant factors. Meanwhile, for multimodal gesture recognition, feature or score fusion in the final stage is a general choice to combine the information of different modalities. Consequently, the gesture-relevant features in different modalities may be redundant, whereas the complementarity of modalities is not exploited sufficiently. To handle these problems, we propose a hierarchical gesture prototype framework to highlight gesture-relevant features such as poses and motions in this article. This framework consists of a sample-level prototype and a modal-level prototype. The sample-level gesture prototype is established with the structure of a memory bank, which avoids the distraction of gesture-irrelevant factors in each sample, such as the illumination, background, and the performers' appearances. Then the modal-level prototype is obtained via a generative adversarial network (GAN)-based subnetwork, in which the modal-invariant features are extracted and pulled together. Meanwhile, the modal-specific attribute features are used to synthesize the feature of other modalities, and the circulation of modality information helps to leverage their complementarity. Extensive experiments on three widely used gesture datasets demonstrate that our method is effective to highlight gesture-relevant features and can outperform the state-of-the-art methods.