SSAT ++ : A Semantic-Aware and Versatile Makeup Transfer Network With Local Color Consistency Constraint

Zhaoyang Sun; Yaxiong Chen; Shengwu Xiong

doi:10.1109/TNNLS.2023.3332065

SSAT ++ : A Semantic-Aware and Versatile Makeup Transfer Network With Local Color Consistency Constraint

IEEE Trans Neural Netw Learn Syst. 2023 Nov 24:PP. doi: 10.1109/TNNLS.2023.3332065. Online ahead of print.

Authors

Zhaoyang Sun, Yaxiong Chen, Shengwu Xiong

PMID: 37999963
DOI: 10.1109/TNNLS.2023.3332065

Abstract

The purpose of makeup transfer (MT) is to transfer makeup from a reference image to a target face while preserving the target's content. Existing methods have made remarkable progress in generating realistic results but do not perform well in terms of semantic correspondence and color fidelity. In addition, the straightforward extension of processing videos frame by frame tends to produce flickering results in most methods. These limitations restrict the applicability of previous methods in real-world scenarios. To address these issues, we propose a symmetric semantic-aware transfer network (SSAT ++ ) to improve makeup similarity and video temporal consistency. For MT, the feature fusion (FF) module first integrates the content and semantic features of the input images, producing multiscale fusion features. Then, the semantic correspondence from the reference to the target is obtained by measuring the correlation of fusion features at each position. According to semantic correspondence, the symmetric mask semantic transfer (SMST) module aligns the reference makeup features with the target content features to generate MT results. Meanwhile, the semantic correspondence from the target to the reference is obtained by transposing the correlation matrix and applied to the makeup removal task. To enhance color fidelity, we propose a novel local color loss that forces the transferred results to have the same color histogram distribution as the reference. Furthermore, a morphing simulation is designed to ensure temporal consistency for video MT without requiring additional video frame input and optical flow estimation. To evaluate the effectiveness of our SSAT ++ , extensive experiments have been conducted on the MT dataset which has a variety of makeup styles, and on the MT-Wild dataset which contains images with diverse poses and expressions. The experiments show that SSAT ++ outperforms existing MT methods through qualitative and quantitative evaluation and provides more flexible makeup control. Code and trained model will be available at https://gitee.com/sunzhaoyang0304/ssat-msp and https://github.com/Snowfallingplum/SSAT.