Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels

IEEE Trans Pattern Anal Mach Intell. 2023 Feb;45(2):2595-2612. doi: 10.1109/TPAMI.2022.3163375. Epub 2023 Jan 6.

Abstract

Semi-supervised video object segmentation (VOS) is to predict the segment of a target object in a video when a ground truth segmentation mask for the target is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising approach for semi-supervised VOS. However, an important point has been overlooked in applying STM to VOS: The solution (=STM) is non-local, but the problem (=VOS) is predominantly local. To solve this mismatch between STM and VOS, we propose new VOS networks called kernelized memory network (KMN) and KMN with multiple kernels (KMN M). Our networks conduct not only Query-to-Memory matching but also Memory-to-Query matching. In Memory-to-Query matching, a kernel is employed to reduce the degree of non-localness of the STM. In addition, we present a Hide-and-Seek strategy in pre-training to handle occlusions effectively. The proposed networks surpass the state-of-the-art results on standard benchmarks by a significant margin (+4% in JM on DAVIS 2017 test-dev set). The runtimes of our proposed KMN and KMN M on DAVIS 2016 validation set are 0.12 and 0.13 seconds per frame, respectively, and the two networks have similar computation times to STM.