From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing

IEEE Trans Pattern Anal Mach Intell. 2024 Feb 16:PP. doi: 10.1109/TPAMI.2024.3366769. Online ahead of print.

Abstract

Human parsing has attracted considerable research interest due to its broad potential applications in the computer vision community. In this paper, we explore several useful properties, including high-resolution representation, auxiliary guidance, and model robustness, which collectively contribute to a novel method for accurate human parsing in both simple and complex scenes. Starting from simple scenes: we propose the boundary-aware hybrid resolution network (BHRN), an advanced human parsing network. BHRN utilizes deconvolutional layers and multi-scale supervision to generate rich high-resolution representations. Additionally, it includes an edge perceiving branch designed to enhance the fineness of part boundaries. Building on BHRN, we construct a dual-task mutual learning (DTML) framework. It not only provides implicit guidance to assist the parser by incorporating boundary features, but also explicitly maintains the high-order consistency between the parsing prediction and the ground truth. Toward complex scenes: we develop a domain transform method to enhance the model robustness. By transforming the input space from the spatial domain to the polar harmonic Fourier moment domain, the mapping relationship to the output semantic space is highly stable. This transformation yields robust representations for both clean and corrupted data. When evaluated on standard benchmark datasets, our method achieves superior performance compared to state-of-the-art human parsing methods. Furthermore, our domain transform strategy significantly improves the robustness of DTML dramatically in most complex scenes.