Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization

IEEE Trans Neural Netw Learn Syst. 2022 Oct 24:PP. doi: 10.1109/TNNLS.2022.3213536. Online ahead of print.

Abstract

Weight decay (WD) is a fundamental and practical regularization technique in improving generalization of current deep learning models. However, it is observed that the WD does not work effectively for an adaptive optimization algorithm (such as Adam), as it works for SGD. Specifically, the solution found by Adam with the WD often generalizes unsatisfactorily. Though efforts have been made to mitigate this issue, the reason for such deficiency is still vague. In this article, we first show that when using the Adam optimizer, the weight norm increases very fast along with the training procedure, which is in contrast to SGD where the weight norm increases relatively slower and tends to converge. The fast increase of weight norm is adverse to WD; in consequence, the Adam optimizer will lose efficacy in finding solution that generalizes well. To resolve this problem, we propose to tailor Adam by introducing a regularization term on the adaptive learning rate, such that it is friendly to WD. Meanwhile, we introduce first moment on the WD to further enhance the regularization effect. We show that the proposed method is able to find solution with small norm and generalizes better than SGD. We test the proposed method on general image classification and fine-grained image classification tasks with different networks. Experimental results on all these cases substantiate the effectiveness of the proposed method in help improving the generalization. Specifically, the proposed method improves the test accuracy of Adam by a large margin and even improves the performance of SGD by 0.84% on CIFAR 10 and 1.03 % on CIFAR 100 with ResNet-50. The code of this article is public available at xxx.