Learning Human-Object Interaction via Interactive Semantic Reasoning

IEEE Trans Image Process. 2021:30:9294-9305. doi: 10.1109/TIP.2021.3125258. Epub 2021 Nov 12.

Abstract

Human-Object Interaction (HOI) detection devotes to learn how humans interact with surrounding objects via inferring triplets of 〈 human, verb, object 〉 . Recent HOI detection methods infer HOIs by directly extracting appearance features and spatial configuration from related visual targets of human and object, but neglect powerful interactive semantic reasoning between these targets. Meanwhile, existing spatial encodings of visual targets have been simply concatenated to appearance features, which is unable to dynamically promote the visual feature learning. To solve these problems, we first present a novel semantic-based Interactive Reasoning Block, in which interactive semantics implied among visual targets are efficiently exploited. Beyond inferring HOIs using discrete instance features, we then design a HOI Inferring Structure to parse pairwise interactive semantics among visual targets in scene-wide level and instance-wide level. Furthermore, we propose a Spatial Guidance Model based on the location of human body-parts and object, which serves as a geometric guidance to dynamically enhance the visual feature learning. Based on the above modules, we construct a framework named Interactive-Net for HOI detection, which is fully differentiable and end-to-end trainable. Extensive experiments show that our proposed framework outperforms existing HOI detection methods on both V-COCO and HICO-DET benchmarks and improves the baseline about 5.9% and 17.7% relatively, validating its efficacy in detecting HOIs.

MeSH terms

  • Algorithms*
  • Humans
  • Semantics*