Layerwised multimodal knowledge distillation for vision-language pretrained model

Neural Netw. 2024 Jul:175:106272. doi: 10.1016/j.neunet.2024.106272. Epub 2024 Mar 26.

Abstract

The transformer-based model can simultaneously learn the representation for both images and text, providing excellent performance for multimodal applications. Practically, the large scale of parameters may hinder its deployment in resource-constrained devices, creating a need for model compression. To accomplish this goal, recent studies suggest using knowledge distillation to transfer knowledge from a larger trained teacher model to a small student model without any performance sacrifice. However, this only works with trained parameters of the student model by using the last layer of the teacher, which makes the student model easily overfit in the distillation procedure. Furthermore, the mutual interference between modalities causes more difficulties for distillation. To address these issues, the study proposed a layerwised multimodal knowledge distillation for a vision-language pretrained model. In addition to the last layer, the intermediate layers of the teacher were also used for knowledge transfer. To avoid interference between modalities, we split the multimodality into separate modalities and added them as extra inputs. Then, two auxiliary losses were implemented to encourage each modality to distill more effectively. Comparative experiments on four different multimodal tasks show that the proposed layerwised multimodality distillation achieves better performance than other KD methods for vision-language pretrained models.

Keywords: Multimodality knowledge distillation; Transformer; UNITER; Vision-language pretrained model.

MeSH terms

  • Humans
  • Knowledge
  • Language
  • Neural Networks, Computer*