Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions

Sensors (Basel). 2023 Jul 9;23(14):6256. doi: 10.3390/s23146256.

Abstract

Research on video anomaly detection has mainly been based on video data. However, many real-world cases involve users who can conceive potential normal and abnormal situations within the anomaly detection domain. This domain knowledge can be conveniently expressed as text descriptions, such as "walking" or "people fighting", which can be easily obtained, customized for specific applications, and applied to unseen abnormal videos not included in the training dataset. We explore the potential of using these text descriptions with unlabeled video datasets. We use large language models to obtain text descriptions and leverage them to detect abnormal frames by calculating the cosine similarity between the input frame and text descriptions using the CLIP visual language model. To enhance the performance, we refined the CLIP-derived cosine similarity using an unlabeled dataset and the proposed text-conditional similarity, which is a similarity measure between two vectors based on additional learnable parameters and a triplet loss. The proposed method has a simple training and inference process that avoids the computationally intensive analyses of optical flow or multiple frames. The experimental results demonstrate that the proposed method outperforms unsupervised methods by showing 8% and 13% better AUC scores for the ShanghaiTech and UCFcrime datasets, respectively. Although the proposed method shows -6% and -5% than weakly supervised methods for those datasets, in abnormal videos, the proposed method shows 17% and 5% better AUC scores, which means that the proposed method shows comparable results with weakly supervised methods that require resource-intensive dataset labeling. These outcomes validate the potential of using text descriptions in unsupervised video anomaly detection.

Keywords: CLIP; abnormal video; embedding space; fine-tuning of pre-trained models; large language models; large vision and language models; similarity measure; text descriptions; unsupervised video anomaly detection.