Rethinking Attentive Object Detection via Neural Attention Learning

IEEE Trans Image Process. 2024:33:1726-1739. doi: 10.1109/TIP.2023.3251693. Epub 2024 Mar 7.

Abstract

Visual attention advances object detection by attending neural networks to object representations. While existing methods incorporate empirical modules to empower network attention, we rethink attentive object detection from the network learning perspective in this work. We propose a NEural Attention Learning approach (NEAL) which consists of two parts. During the back-propagation of each training iteration, we first calculate the partial derivatives (a.k.a. the accumulated gradients) of the classification output with respect to the input features. We refine these partial derivatives to obtain attention response maps whose elements reflect the contributions to the final network predictions. Then, we formulate the attention response maps as extra objective functions, which are combined together with the original detection loss to train detectors in an end-to-end manner. In this way, we succeed in learning an attentive CNN model without introducing additional network structures. We apply NEAL to the two-stage object detection frameworks, which are usually composed of a CNN feature backbone, a region proposal network (RPN), and a classifier. We show that the proposed NEAL not only helps the RPN attend to objects but also enables the classifier to pay more attention to the premier positive samples. To this end, the localization (proposal generation) and classification mutually benefit from each other in our proposed method. Extensive experiments on large-scale benchmark datasets, including MS COCO 2017 and Pascal VOC 2012, demonstrate that the proposed NEAL algorithm advances the two-stage object detector over state-of-the-art approaches.