Tasks Integrated Networks: Joint Detection and Retrieval for Image Search

Lei Zhang; Zhenwei He; Yi Yang; Liang Wang; Xinbo Gao

doi:10.1109/TPAMI.2020.3009758

Tasks Integrated Networks: Joint Detection and Retrieval for Image Search

IEEE Trans Pattern Anal Mach Intell. 2022 Jan;44(1):456-473. doi: 10.1109/TPAMI.2020.3009758. Epub 2021 Dec 7.

Authors

Lei Zhang, Zhenwei He, Yi Yang, Liang Wang, Xinbo Gao

PMID: 32750818
DOI: 10.1109/TPAMI.2020.3009758

Abstract

The traditional object (person) retrieval (re-identification) task aims to learn a discriminative feature representation with intra-similarity and inter-dissimilarity, which supposes that the objects in an image are manually or automatically pre-cropped exactly. However, in many real-world searching scenarios (e.g., video surveillance), the objects (e.g., persons, vehicles, etc.) are seldom accurately detected or annotated. Therefore, object-level retrieval becomes intractable without bounding-box annotation, which leads to a new but challenging topic, i.e., image-level search with multi-task integration of joint detection and retrieval. In this paper, to address the image search issue, we first introduce an end-to-end Integrated Net (I-Net), which has three merits: 1) A Siamese architecture and an on-line pairing strategy for similar and dissimilar objects in the given images are designed. Benefited by the Siamese structure, I-Net learns the shared feature representation, because, on which, both object detection and classification tasks are handled. 2) A novel on-line pairing (OLP) loss is introduced with a dynamic feature dictionary, which alleviates the multi-task training stagnation problem, by automatically generating a number of negative pairs to restrict the positives. 3) A hard example priority (HEP) based softmax loss is proposed to improve the robustness of classification task by selecting hard categories. The shared feature representation of I-Net may restrict the task-specific flexibility and learning capability between detection and retrieval tasks. Therefore, with the philosophy of divide and conquer, we further propose an improved I-Net, called DC-I-Net, which makes two new contributions: 1) two modules are tailored to handle different tasks separately in the integrated framework, such that the task specification is guaranteed. 2) A class-center guided HEP loss (C ²HEP) by exploiting the stored class centers is proposed, such that the intra-similarity and inter-dissimilarity can be captured for ultimate retrieval. Extensive experiments on famous image-level search oriented benchmark datasets, such as CUHK-SYSU dataset and PRW dataset for person search and the large-scale WebTattoo dataset for tattoo search, demonstrate that the proposed DC-I-Net outperforms the state-of-the-art tasks-integrated and tasks-separated image search models.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms*
Humans