Prototypical Matching Networks for Video Object Segmentation

Fanchao Lin; Zhaofan Qiu; Chuanbin Liu; Ting Yao; Hongtao Xie; Yongdong Zhang

doi:10.1109/TIP.2023.3321462

Prototypical Matching Networks for Video Object Segmentation

IEEE Trans Image Process. 2023:32:5623-5636. doi: 10.1109/TIP.2023.3321462. Epub 2023 Oct 17.

Authors

Fanchao Lin, Zhaofan Qiu, Chuanbin Liu, Ting Yao, Hongtao Xie, Yongdong Zhang

PMID: 37812538
DOI: 10.1109/TIP.2023.3321462

Abstract

Semi-supervised video object segmentation is the task of segmenting the target in sequential frames given the ground truth mask in the first frame. The modern approaches usually utilize such a mask as pixel-level supervision and typically exploit pixel-to-pixel matching between the reference frame and current frame. However, the matching at pixel level, which overlooks the high-level information beyond local areas, often suffers from confusion caused by similar local appearances. In this paper, we present Prototypical Matching Networks (PMNet) - a novel architecture that integrates prototypes into matching-based video objection segmentation frameworks as high-level supervision. Specifically, PMNet first divides the foreground and background areas into several parts according to the similarity to the global prototypes. The part-level prototypes and instance-level prototypes are generated by encapsulating the semantic information of identical parts and identical instances, respectively. To model the correlation between prototypes, the prototype representations are propagated to each other by reasoning on a graph structure. Then, PMNet stores both the pixel-level features and prototypes in the memory bank as the target cues. Three affinities, i.e., pixel-to-pixel affinity, prototype-to-pixel affinity, and prototype-to-prototype affinity, are derived to measure the similarity between the query frame and the features in the memory bank. The features aggregated from the memory bank using these affinities provide powerful discrimination from both the pixel-level and prototype-level perspectives. Extensive experiments conducted on four benchmarks demonstrate superior results than the state-of-the-art video object segmentation techniques.