Video Moment Localization via Deep Cross-Modal Hashing

IEEE Trans Image Process. 2021:30:4667-4677. doi: 10.1109/TIP.2021.3073867. Epub 2021 May 3.

Abstract

Due to the continuous booming of surveillance and Web videos, video moment localization, as an important branch of video content analysis, has attracted wide attention from both industry and academia in recent years. It is, however, a non-trivial task due to the following challenges: temporal context modeling, intelligent moment candidate generation, as well as the necessary efficiency and scalability in practice. To address these impediments, we present a deep end-to-end cross-modal hashing network. To be specific, we first design a video encoder relying on a bidirectional temporal convolutional network to simultaneously generate moment candidates and learn their representations. Considering that the video encoder characterizes temporal contextual structures at multiple scales of time windows, we can thus obtain enhanced moment representations. As a counterpart, we design an independent query encoder towards user intention understanding. Thereafter, a cross-model hashing module is developed to project these two heterogeneous representations into a shared isomorphic Hamming space for compact hash code learning. After that, we can effectively estimate the relevance score of each "moment-query" pair via the Hamming distance. Besides effectiveness, our model is far more efficient and scalable since the hash codes of videos can be learned offline. Experimental results on real-world datasets have justified the superiority of our model over several state-of-the-art competitors.