Improving Transferability of Universal Adversarial Perturbation With Feature Disruption

IEEE Trans Image Process. 2024:33:722-737. doi: 10.1109/TIP.2023.3345136. Epub 2024 Jan 12.

Abstract

Deep neural networks (DNNs) are shown to be vulnerable to universal adversarial perturbations (UAP), a single quasi-imperceptible perturbation that deceives the DNNs on most input images. The current UAP methods can be divided into data-dependent and data-independent methods. The former exhibits weak transferability in black-box models due to overly relying on model-specific features. The latter shows inferior attack performance in white-box models as it fails to exploit the model's response information to benign images. To address the above issues, this paper proposes a novel universal adversarial attack to generate UAP with strong transferability by disrupting the model-agnostic features (e.g., edges or simple texture), which are invariant to the models. Specifically, we first devise an objective function to weaken the significant channel-wise features and strengthen the less significant channel-wise features, which are partitioned by the designed strategy. Furthermore, the proposed objective function eliminates the dependency on labeled samples, allowing us to utilize out-of-distribution (OOD) data to train UAP. To enhance the attack performance with limited training samples, we exploit the average gradient of the mini-batch input to update the UAP iteratively, which encourages the UAP to capture the local information inside the mini-batch input. In addition, we introduce the momentum term to accumulate the gradient information at each iterative step for the purpose of perceiving the global information over the training set. Finally, extensive experimental results demonstrate that the proposed methods outperform the existing UAP approaches. Additionally, we exhaustively investigate the transferability of the UAP across models, datasets, and tasks.