Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Kechen Song; Yiming Zhang; Yanqi Bao; Ying Zhao; Yunhui Yan

doi:10.3390/s23146612

Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation

Sensors (Basel). 2023 Jul 22;23(14):6612. doi: 10.3390/s23146612.

Authors

Kechen Song¹, Yiming Zhang¹, Yanqi Bao², Ying Zhao¹, Yunhui Yan¹

Affiliations

¹ School of Mechanical Engineering & Automation, Northeastern University, Shenyang 110819, China.
² National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China.

Abstract

As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5ⁱ for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs.

Keywords: few-shot semantic segmentation; multi-modal images; three-modal registration.

Grants and funding

51805078/National Natural Science Foundation of China