Self-Supervised Video Representation Learning by Video Incoherence Detection

Haozhi Cao; Yuecong Xu; Kezhi Mao; Lihua Xie; Jianxiong Yin; Simon See; Qianwen Xu; Jianfei Yang

doi:10.1109/TCYB.2023.3265393

Self-Supervised Video Representation Learning by Video Incoherence Detection

IEEE Trans Cybern. 2023 Apr 20:PP. doi: 10.1109/TCYB.2023.3265393. Online ahead of print.

Authors

Haozhi Cao, Yuecong Xu, Kezhi Mao, Lihua Xie, Jianxiong Yin, Simon See, Qianwen Xu, Jianfei Yang

PMID: 37079425
DOI: 10.1109/TCYB.2023.3265393

Abstract

This article introduces a novel self-supervised method that leverages incoherence detection for video representation learning. It stems from the observation that the visual system of human beings can easily identify video incoherence based on their comprehensive understanding of videos. Specifically, we construct the incoherent clip by multiple subclips hierarchically sampled from the same raw video with various lengths of incoherence. The network is trained to learn the high-level representation by predicting the location and length of incoherence given the incoherent clip as input. Additionally, we introduce intravideo contrastive learning to maximize the mutual information between incoherent clips from the same raw video. We evaluate our proposed method through extensive experiments on action recognition and video retrieval using various backbone networks. Experiments show that our proposed method achieves remarkable performance across different backbone networks and different datasets compared to previous coherence-based methods.