Multistructure-Based Collaborative Online Distillation

Liang Gao; Xu Lan; Haibo Mi; Dawei Feng; Kele Xu; Yuxing Peng

doi:10.3390/e21040357

Multistructure-Based Collaborative Online Distillation

Entropy (Basel). 2019 Apr 2;21(4):357. doi: 10.3390/e21040357.

Authors

Liang Gao¹, Xu Lan², Haibo Mi¹, Dawei Feng¹, Kele Xu¹, Yuxing Peng¹

Affiliations

¹ National Key Laboratory of Parallel and Distributed Processing, College of Computer, National University of Defense Technology, Changsha 410073, China.
² School of Electronic Engineering and Computer Science, Queen Mary University of London, London E14NS, UK.

Abstract

Recently, deep learning has achieved state-of-the-art performance in more aspects than traditional shallow architecture-based machine-learning methods. However, in order to achieve higher accuracy, it is usually necessary to extend the network depth or ensemble the results of different neural networks. Increasing network depth or ensembling different networks increases the demand for memory resources and computing resources. This leads to difficulties in deploying depth-learning models in resource-constrained scenarios such as drones, mobile phones, and autonomous driving. Improving network performance without expanding the network scale has become a hot topic for research. In this paper, we propose a cross-architecture online-distillation approach to solve this problem by transmitting supplementary information on different networks. We use the ensemble method to aggregate networks of different structures, thus forming better teachers than traditional distillation methods. In addition, discontinuous distillation with progressively enhanced constraints is used to replace fixed distillation in order to reduce loss of information diversity in the distillation process. Our training method improves the distillation effect and achieves strong network-performance improvement. We used some popular models to validate the results. On the CIFAR100 dataset, AlexNet's accuracy was improved by 5.94%, VGG by 2.88%, ResNet by 5.07%, and DenseNet by 1.28%. Extensive experiments were conducted to demonstrate the effectiveness of the proposed method. On the CIFAR10, CIFAR100, and ImageNet datasets, we observed significant improvements over traditional knowledge distillation.

Keywords: deep learning; distributed architecture; knowledge distillation; supplementary information.