Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks

Ziyi Liu; Le Wang; Gang Hua; Qilin Zhang; Zhenxing Niu; Ying Wu; Nanning Zheng

doi:10.1109/TIP.2018.2859622

Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks

IEEE Trans Image Process. 2018 Dec;27(12):5840-5853. doi: 10.1109/TIP.2018.2859622. Epub 2018 Jul 30.

Authors

Ziyi Liu, Le Wang, Gang Hua, Qilin Zhang, Zhenxing Niu, Ying Wu, Nanning Zheng

PMID: 30059300
DOI: 10.1109/TIP.2018.2859622

Abstract

It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.