Global and Local Interactive Perception Network for Referring Image Segmentation

IEEE Trans Neural Netw Learn Syst. 2023 Sep 11:PP. doi: 10.1109/TNNLS.2023.3308550. Online ahead of print.

Abstract

The effective modal fusion and perception between the language and the image are necessary for inferring the reference instance in the referring image segmentation (RIS) task. In this article, we propose a novel RIS network, the global and local interactive perception network (GLIPN), to enhance the quality of modal fusion between the language and the image from the local and global perspectives. The core of GLIPN is the global and local interactive perception (GLIP) scheme. Specifically, the GLIP scheme contains the local perception module (LPM) and the global perception module (GPM). The LPM is designed to enhance the local modal fusion by the correspondence between word and image local semantics. The GPM is designed to inject the global structured semantics of images into the modal fusion process, which can better guide the word embedding to perceive the whole image's global structure. Combined with the local-global context semantics fusion, extensive experiments on several benchmark datasets demonstrate the advantage of the proposed GLIPN over most state-of-the-art approaches.