Cross-modality integration framework with prediction, perception and discrimination for video anomaly detection

Neural Netw. 2024 Apr:172:106138. doi: 10.1016/j.neunet.2024.106138. Epub 2024 Jan 19.

Abstract

Video anomaly detection is an important task for public security in the multimedia field. It aims to distinguish events that deviate from normal patterns. As important semantic representation, the textual information can effectively perceive different contents for anomaly detection. However, most existing methods primarily rely on visual modality, with limited incorporation of textual modality in anomaly detection. In this paper, a cross-modality integration framework (CIForAD) is proposed for anomaly detection, which combines both textual and visual modalities for prediction, perception and discrimination. Firstly, a feature fusion prediction (FUP) module is designed to predict the target regions by fusing the visual features and textual features for prompting, which can amplify the discriminative distance. Then an image-text semantic perception (ISP) module is developed to judge semantic consistency by associating the fine-grained visual features with textual features, where a strategy of local training and global inference is introduced to perceive local details and global semantic correlation. Finally, a self-supervised time attention discrimination (TAD) module is built to explore the inter-frame relation and further distinguish abnormal sequences from normal sequences. Extensive experiments on the three challenging benchmarks indicate that our CIForAD obtains state-of-the-art anomaly detection performance.

Keywords: Anomaly detection; Frame prediction; Perception; Temporal discrimination.

MeSH terms

  • Benchmarking*
  • Multimedia*
  • Perception
  • Semantics