Few-shot human-object interaction video recognition with transformers

Neural Netw. 2023 Jun:163:1-9. doi: 10.1016/j.neunet.2023.01.019. Epub 2023 Feb 10.

Abstract

We propose a novel few-shot learning framework that can recognize human-object interaction (HOI) classes with a few labeled samples. We achieve this by leveraging a meta-learning paradigm where human-object interactions are embedded into compact features for similarity calculation. More specifically, spatial and temporal relationships of HOI in videos are constructed with transformers which boost the performance over the baseline significantly. First, we present a spatial encoder that extracts the spatial context and infers frame-level features of a human and objects in each frame. And then the video-level feature is obtained by encoding a series of frame-level feature vectors with a temporal encoder. Experiments on two datasets, CAD-120 and Something-Else, validate that our approach achieves 7.8% and 15.2% accuracy improvement on 1-shot task, 4.7% and 15.7% on 5-shot task, which outperforms the state-of-the-art methods.

Keywords: Few-shot learning; Human–object interaction recognition; Meta-learning; Transformers.

MeSH terms

  • Humans
  • Learning*
  • Visual Perception*