Exploring Language Hierarchy for Video Grounding

IEEE Trans Image Process. 2022:31:4693-4706. doi: 10.1109/TIP.2022.3187288. Epub 2022 Jul 12.

Abstract

The understanding of language plays a key role in video grounding, where a target moment is localized according to a text query. From a biological point of view, language is naturally hierarchical, with the main clause (predicate phrase) providing coarse semantics and modifiers providing detailed descriptions. In video grounding, moments described by the main clause may exist in multiple clips of a long video, including both the ground-truth and background clips. Therefore, in order to correctly discriminate the ground-truth clip from the background ones, this co-existence leads to the negligence of the main clause, and concentrate the model on the modifiers that provide discriminative information on distinguishing the target proposal from the others. We first demonstrate this phenomenon empirically, and propose a Hierarchical Language Network (HLN) that exploits the language hierarchy, as well as a new learning approach called Multi-Instance Positive-Unlabelled Learning (MI-PUL) to alleviate the above problem. Specifically, in HLN, the localization is performed on various layers of the language hierarchy, so that the attention can be paid to different parts of the sentences, rather than only discriminative ones. Furthermore, MI-PUL allows the model to localize background clips that can be possibly described by the main clause, even without manual annotations. Therefore, the union of the two proposed components enhances the learning of the main clause, which is of critical importance in video grounding. Finally, we evaluate that our proposed HLN can plug into the current methods and improve their performance. Extensive experiments on challenging datasets show HLN significantly improve the state-of-the-art methods, especially achieving 6.15% gain in terms of [Formula: see text] on the TACoS dataset.

MeSH terms

  • Attention
  • Language*
  • Semantics*