QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking

IEEE Trans Image Process. 2024:33:3187-3199. doi: 10.1109/TIP.2024.3393298. Epub 2024 May 6.

Abstract

Existing RGB-Thermal trackers usually treat intra-modal feature extraction and inter-modal feature fusion as two separate processes, therefore the mutual promotion of extraction and fusion is neglected. Then, the complementary advantages of RGB-T fusion are not fully exploited, and the independent feature extraction is not adaptive to modal quality fluctuation during tracking. To address the limitations, we design a joint-modality query fusion network, in which the intra-modal feature extraction and the inter-modal fusion are coupled together and promote each other via joint-modality queries. The queries are initialized based on the multimodal features of the current frame, making the subsequent fusion adaptive to modal quality fluctuation during tracking. Then the joint-modality query fusion (JQF) utilizes the queries to interact with RGB-T features, allowing the intra-modal enhancement and the inter-modal interactions to be unified for mutual promotion. In this way, JQF can distinguish and enhance the complementary modality features, while filtering out redundant information. For real-time tracking, we propose regional cross-attention for cross-modal interactions to reduce computational cost. Our end-to-end tracker sets a new state-of-the-art performance on multiple RGBT tracking benchmarks including LasHeR, VTUAV, RGBT234 and GTOT, while running at a real-time speed.