Few-shot human-object interaction video recognition with transformers

Qiyue Li; Xuemei Xie; Jin Zhang; Guangming Shi

doi:10.1016/j.neunet.2023.01.019

Few-shot human-object interaction video recognition with transformers

Neural Netw. 2023 Jun:163:1-9. doi: 10.1016/j.neunet.2023.01.019. Epub 2023 Feb 10.

Authors

Qiyue Li¹, Xuemei Xie², Jin Zhang¹, Guangming Shi¹

Affiliations

¹ School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, 710071, PR China.
² School of Artificial Intelligence, Xidian University, Xi'an, Shaanxi, 710071, PR China. Electronic address: xmxie@mail.xidian.edu.cn.

PMID: 37003110
DOI: 10.1016/j.neunet.2023.01.019

Abstract

We propose a novel few-shot learning framework that can recognize human-object interaction (HOI) classes with a few labeled samples. We achieve this by leveraging a meta-learning paradigm where human-object interactions are embedded into compact features for similarity calculation. More specifically, spatial and temporal relationships of HOI in videos are constructed with transformers which boost the performance over the baseline significantly. First, we present a spatial encoder that extracts the spatial context and infers frame-level features of a human and objects in each frame. And then the video-level feature is obtained by encoding a series of frame-level feature vectors with a temporal encoder. Experiments on two datasets, CAD-120 and Something-Else, validate that our approach achieves 7.8% and 15.2% accuracy improvement on 1-shot task, 4.7% and 15.7% on 5-shot task, which outperforms the state-of-the-art methods.

Keywords: Few-shot learning; Human–object interaction recognition; Meta-learning; Transformers.

MeSH terms

Humans
Learning*
Visual Perception*