Satellite Video Multi-Label Scene Classification With Spatial and Temporal Feature Cooperative Encoding: A Benchmark Dataset and Method

IEEE Trans Image Process. 2024:33:2238-2251. doi: 10.1109/TIP.2024.3374100. Epub 2024 Mar 21.

Abstract

Satellite video multi-label scene classification predicts semantic labels of multiple ground contents to describe a given satellite observation video, which plays an important role in applications like ocean observation, smart cities, et al. However, the lack of a high-quality and large-scale dataset prevents further improvement of the task. And existing methods on general videos have the difficulty to represent the local details of ground contents when directly applied to the satellite videos. In this paper, our contributions include (1) we develop the first publicly available and large-scale satellite video multi-label scene classification dataset. It consists of 18 classes of static and dynamic ground contents, 3549 videos, and 141960 frames. (2) we propose a baseline method with the novel Spatial and Temporal Feature Cooperative Encoding (STFCE). It exploits the relations between local spatial and temporal features, and models long-term motion information hidden in inter-frame variations. In this way, it can enhance features of local details and obtain the powerful video-scene-level feature representation, which raises the classification performance effectively. Experimental results show that our proposed STFCE outperforms 13 state-of-the-art methods with a global average precision (GAP) of 0.8106 and the careful fusion and joint learning of the spatial, temporal, and motion features are beneficial to achieve a more robust and accurate model. Moreover, benchmarking results show that the proposed dataset is very challenging and we hope it could promote further development of the satellite video multi-label scene classification task.