Tolerant Self-Distillation for image classification

Neural Netw. 2024 Jun:174:106215. doi: 10.1016/j.neunet.2024.106215. Epub 2024 Feb 28.

Abstract

Deep neural networks tend to suffer from the overfitting issue when the training data are not enough. In this paper, we introduce two metrics from the intra-class distribution of correct-predicted and incorrect-predicted samples to provide a new perspective on the overfitting issue. Based on it, we propose a knowledge distillation approach without pretraining a teacher model in advance named Tolerant Self-Distillation (TSD) for alleviating the overfitting issue. It introduces an online updating memory and selectively stores the class predictions of the samples from the past iterations, making it possible to distill knowledge across the iterations. Specifically, the class predictions stored in the memory bank serve as the soft labels for supervising the samples from the same class for the current iteration in a reverse way, i.e. the correct-predicted samples are supervised with the incorrect predictions while the incorrect-predicted samples are supervised with the correct predictions. Consequently, the premature convergence issue caused by the over-confident samples would be mitigated, which helps the model to converge to a better local optimum. Extensive experimental results on several image classification benchmarks, including small-scale, large-scale, and fine-grained datasets, demonstrate the superiority of the proposed TSD.

Keywords: Deep Learning; Overfitting; Self-Distillation; Tolerant.

MeSH terms

  • Benchmarking*
  • Knowledge*
  • Neural Networks, Computer