KE-RCNN: Unifying Knowledge-Based Reasoning Into Part-Level Attribute Parsing

Xuanhan Wang; Jingkuan Song; Xiaojia Chen; Lechao Cheng; Lianli Gao; Heng Tao Shen

doi:10.1109/TCYB.2022.3209653

KE-RCNN: Unifying Knowledge-Based Reasoning Into Part-Level Attribute Parsing

IEEE Trans Cybern. 2023 Nov;53(11):7263-7274. doi: 10.1109/TCYB.2022.3209653. Epub 2023 Oct 17.

Authors

Xuanhan Wang, Jingkuan Song, Xiaojia Chen, Lechao Cheng, Lianli Gao, Heng Tao Shen

PMID: 36251898
DOI: 10.1109/TCYB.2022.3209653

Abstract

Part-level attribute parsing is a fundamental but challenging task, which requires the region-level visual understanding to provide explainable details of body parts. Most existing approaches address this problem by adding a regional convolutional neural network (RCNN) with an attribute prediction head to a two-stage detector, in which attributes of body parts are identified from localwise part boxes. However, localwise part boxes with limit visual clues (i.e., part appearance only) lead to unsatisfying parsing results, since attributes of body parts are highly dependent on comprehensive relations among them. In this article, we propose a knowledge-embedded RCNN (KE-RCNN) to identify attributes by leveraging rich knowledge, including implicit knowledge (e.g., the attribute "above-the-hip" for a shirt requires visual/geometry relations of shirt-hip) and explicit knowledge (e.g., the part of "shorts" cannot have the attribute of "hoodie" or "lining"). Specifically, the KE-RCNN consists of two novel components, that is: 1) implicit knowledge-based encoder (IK-En) and 2) explicit knowledge-based decoder (EK-De). The former is designed to enhance part-level representation by encoding part-part relational contexts into part boxes, and the latter one is proposed to decode attributes with a guidance of prior knowledge about part-attribute relations. In this way, the KE-RCNN is plug-and-play, which can be integrated into any two-stage detectors, for example, Attribute-RCNN, Cascade-RCNN, HRNet-based RCNN, and SwinTransformer-based RCNN. Extensive experiments conducted on two challenging benchmarks, for example, Fashionpedia and Kinetics-TPS, demonstrate the effectiveness and generalizability of the KE-RCNN. In particular, it achieves higher improvements over all existing methods, reaching around 3% of AP ^all_IoU+F₁ on Fashionpedia and around 4% of Acc_p on Kinetics-TPS. Code and models are publicly available at: https://github.com/sota-joson/KE-RCNN.