Dense Pixel-Level Interpretation of Dynamic Scenes With Video Panoptic Segmentation

Dahun Kim; Sanghyun Woo; Joon-Young Lee; In So Kweon

doi:10.1109/TIP.2022.3183440

Dense Pixel-Level Interpretation of Dynamic Scenes With Video Panoptic Segmentation

IEEE Trans Image Process. 2022:31:5383-5395. doi: 10.1109/TIP.2022.3183440. Epub 2022 Aug 17.

Authors

Dahun Kim, Sanghyun Woo, Joon-Young Lee, In So Kweon

PMID: 35749323
DOI: 10.1109/TIP.2022.3183440

Abstract

A holistic understanding of dynamic scenes is of fundamental importance in real-world computer vision problems such as autonomous driving, augmented reality and spatio-temporal reasoning. In this paper, we propose a new computer vision benchmark: Video Panoptic Segmentation (VPS). To study this important problem, we present two datasets, Cityscapes-VPS and VIPER together with a new evaluation metric, video panoptic quality (VPQ). We also propose VPSNet++, an advanced video panoptic segmentation network, which simultaneously performs classification, detection, segmentation, and tracking of all identities in videos. Specifically, VPSNet++ builds upon a top-down panoptic segmentation network by adding pixel-level feature fusion head and object-level association head. The former temporally augments the pixel features while the latter performs object tracking. Furthermore, we propose panoptic boundary learning as an auxiliary task, and instance discrimination learning which learns spatio-temporally clustered pixel embedding for individual thing or stuff regions, i.e., exactly the objective of the video panoptic segmentation problem. Our VPSNet++ significantly outperforms the default VPSNet, i.e., FuseTrack baseline, and achieves state-of-the-art results on both Cityscapes-VPS and VIPER datasets. The datasets, metric, and models are publicly available at https://github.com/mcahny/vps.