Context-Aware Proposal-Boundary Network With Structural Consistency for Audiovisual Event Localization

Hao Wang; Zheng-Jun Zha; Liang Li; Xuejin Chen; Jiebo Luo

doi:10.1109/TNNLS.2023.3290083

Context-Aware Proposal-Boundary Network With Structural Consistency for Audiovisual Event Localization

IEEE Trans Neural Netw Learn Syst. 2023 Jul 19:PP. doi: 10.1109/TNNLS.2023.3290083. Online ahead of print.

Authors

Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, Jiebo Luo

PMID: 37467094
DOI: 10.1109/TNNLS.2023.3290083

Abstract

Audiovisual event localization aims to localize the event that is both visible and audible in a video. Previous works focus on segment-level audio and visual feature sequence encoding and neglect the event proposals and boundaries, which are crucial for this task. The event proposal features provide event internal consistency between several consecutive segments constructing one proposal, while the event boundary features offer event boundary consistency to make segments located at boundaries be aware of the event occurrence. In this article, we explore the proposal-level feature encoding and propose a novel context-aware proposal-boundary (CAPB) network to address audiovisual event localization. In particular, we design a local-global context encoder (LGCE) to aggregate local-global temporal context information for visual sequence, audio sequence, event proposals, and event boundaries, respectively. The local context from temporally adjacent segments or proposals contributes to event discrimination, while the global context from the entire video provides semantic guidance of temporal relationship. Furthermore, we enhance the structural consistency between segments by exploiting the above-encoded proposal and boundary representations. CAPB leverages the context information and structural consistency to obtain context-aware event-consistent cross-modal representation for accurate event localization. Extensive experiments conducted on the audiovisual event (AVE) dataset show that our approach outperforms the state-of-the-art methods by clear margins in both supervised event localization and cross-modality localization.