Sparse R-CNN: An End-to-End Framework for Object Detection

IEEE Trans Pattern Anal Mach Intell. 2023 Dec;45(12):15650-15664. doi: 10.1109/TPAMI.2023.3292030. Epub 2023 Nov 3.

Abstract

Object detection serves as one of most fundamental computer vision tasks. Existing works on object detection heavily rely on dense object candidates, such as k anchor boxes pre-defined on all grids of an image feature map of size H×W. In this paper, we present Sparse R-CNN, a very simple and sparse method for object detection in images. In our method, a fixed sparse set of learned object proposals ( N in total) are provided to the object recognition head to perform classification and localization. By replacing HWk (up to hundreds of thousands) hand-designed object candidates with N (e.g., 100) learnable proposals, Sparse R-CNN makes all efforts related to object candidates design and one-to-many label assignment completely obsolete. More importantly, Sparse R-CNN directly outputs predictions without the non-maximum suppression (NMS) post-processing procedure. Thus, it establishes an end-to-end object detection framework. Sparse R-CNN demonstrates highly competitive accuracy, run-time and training convergence performance with the well-established detector baselines on the challenging COCO dataset and CrowdHuman dataset. We hope that our work can inspire re-thinking the convention of dense prior in object detectors and designing new high-performance detectors.