Adaptive Selection of Reference Frames for Video Object Segmentation

IEEE Trans Image Process. 2022:31:1057-1071. doi: 10.1109/TIP.2021.3137660. Epub 2022 Jan 19.

Abstract

Video object segmentation is a challenging task in computer vision because the appearances of target objects might change drastically along the time in the video. To solve this problem, space-time memory (STM) networks are exploited to make use of the information from all the intermediate frames between the first frame and the current frame in the video. However, fully using the information from all the memory frames may make STM not practical for long videos. To overcome this issue, a novel method is developed in this paper to select the reference frames adaptively. First, an adaptive selection criterion is introduced to choose the reference frames with similar appearance and precise mask estimation, which can efficiently capture the rich information of the target object and overcome the challenges of appearance changes, occlusion, and model drift. Secondly, bi-matching (bi-scale and bi-direction) is conducted to obtain more robust correlations for objects of various scales and prevents multiple similar objects in the current frame from being mismatched with the same target object in the reference frame. Thirdly, a novel edge refinement technique is designed by using an edge detection network to obtain smooth edges from the outputs of edge confidence maps, where the edge confidence is quantized into ten sub-intervals to generate smooth edges step by step. Experimental results on the challenging benchmark datasets DAVIS-2016, DAVIS-2017, YouTube-VOS, and a Long-Video dataset have demonstrated the effectiveness of our proposed approach to video object segmentation.