A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing

Wadii Boulila; Hamza Ghandorh; Sharjeel Masood; Ayyub Alzahem; Anis Koubaa; Fawad Ahmed; Zahid Khan; Jawad Ahmad

doi:10.1016/j.heliyon.2024.e29396

A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing

Heliyon. 2024 Apr 12;10(8):e29396. doi: 10.1016/j.heliyon.2024.e29396. eCollection 2024 Apr 30.

Authors

Wadii Boulila^{1

2}, Hamza Ghandorh³, Sharjeel Masood⁴, Ayyub Alzahem¹, Anis Koubaa¹, Fawad Ahmed⁵, Zahid Khan¹, Jawad Ahmad⁶

Affiliations

¹ Robotics and Internet-of-Things Laboratory, Prince Sultan University, Riyadh 12435, Saudi Arabia.
² RIADI Laboratory, National School of Computer Science, University of Manouba, Manouba 2010, Tunisia.
³ College of Computer Science and Engineering, Taibah University, Medina 42353, Saudi Arabia.
⁴ Department of IT and Energy Convergence, Korea National University of Transportation, Chungju, South Korea.
⁵ Department of Cyber Security, Pakistan Navy Engineering College, NUST, Islamabad 75350, Pakistan.
⁶ School of Computing, Engineering and the Built Environment, Edinburgh Napier University, Edinburgh EH10 5DT, United Kingdom.

Abstract

Semantic segmentation of Remote Sensing (RS) images involves the classification of each pixel in a satellite image into distinct and non-overlapping regions or segments. This task is crucial in various domains, including land cover classification, autonomous driving, and scene understanding. While deep learning has shown promising results, there is limited research that specifically addresses the challenge of processing fine details in RS images while also considering the high computational demands. To tackle this issue, we propose a novel approach that combines convolutional and transformer architectures. Our design incorporates convolutional layers with a low receptive field to generate fine-grained feature maps for small objects in very high-resolution images. On the other hand, transformer blocks are utilized to capture contextual information from the input. By leveraging convolution and self-attention in this manner, we reduce the need for extensive downsampling and enable the network to work with full-resolution features, which is particularly beneficial for handling small objects. Additionally, our approach eliminates the requirement for vast datasets, which is often necessary for purely transformer-based networks. In our experimental results, we demonstrate the effectiveness of our method in generating local and contextual features using convolutional and transformer layers, respectively. Our approach achieves a mean dice score of 80.41%, outperforming other well-known techniques such as UNet, Fully-Connected Network (FCN), Pyramid Scene Parsing Network (PSP Net), and the recent Convolutional vision Transformer (CvT) model, which achieved mean dice scores of 78.57%, 74.57%, 73.45%, and 62.97% respectively, under the same training conditions and using the same training dataset.

Keywords: Remote sensing; Satellite images; Self-attention; Semantic segmentation; Vision transformer.