Zero time waste in pre-trained early exit neural networks

Bartosz Wójcik; Marcin Przewiȩźlikowski; Filip Szatkowski; Maciej Wołczyk; Klaudia Bałazy; Bartłomiej Krzepkowski; Igor Podolak; Jacek Tabor; Marek Śmieja; Tomasz Trzciński

doi:10.1016/j.neunet.2023.10.003

Zero time waste in pre-trained early exit neural networks

Neural Netw. 2023 Nov:168:580-601. doi: 10.1016/j.neunet.2023.10.003. Epub 2023 Oct 9.

Authors

Affiliations

¹ Faculty of Mathematics and Computer Science, Jagiellonian University, Poland; Doctoral School of Exact and Natural Sciences, Jagiellonian University, Poland; IDEAS NCBR, Poland. Electronic address: b.wojcik@doctoral.uj.edu.pl.
² Faculty of Mathematics and Computer Science, Jagiellonian University, Poland; Doctoral School of Exact and Natural Sciences, Jagiellonian University, Poland; IDEAS NCBR, Poland.
³ Warsaw University of Technology, Poland; IDEAS NCBR, Poland.
⁴ Faculty of Mathematics and Computer Science, Jagiellonian University, Poland; Doctoral School of Exact and Natural Sciences, Jagiellonian University, Poland.
⁵ University of Warsaw, Poland; IDEAS NCBR, Poland.
⁶ Faculty of Mathematics and Computer Science, Jagiellonian University, Poland.
⁷ Faculty of Mathematics and Computer Science, Jagiellonian University, Poland; Warsaw University of Technology, Poland; IDEAS NCBR, Poland; Tooploox, Poland.

PMID: 37837747
DOI: 10.1016/j.neunet.2023.10.003

Abstract

The problem of reducing processing time of large deep learning models is a fundamental challenge in many real-world applications. Early exit methods strive towards this goal by attaching additional Internal Classifiers (ICs) to intermediate layers of a neural network. ICs can quickly return predictions for easy examples and, as a result, reduce the average inference time of the whole model. However, if a particular IC does not decide to return an answer early, its predictions are discarded, with its computations effectively being wasted. To solve this issue, we introduce Zero Time Waste (ZTW), a novel approach in which each IC reuses predictions returned by its predecessors by (1) adding direct connections between ICs and (2) combining previous outputs in an ensemble-like manner. We conduct extensive experiments across various multiple modes, datasets, and architectures to demonstrate that ZTW achieves a significantly better accuracy vs. inference time trade-off than other early exit methods. On the ImageNet dataset, it obtains superior results over the best baseline method in 11 out of 16 cases, reaching up to 5 percentage points of improvement on low computational budgets.

Keywords: Conditional computation; Deep learning; Dynamic neural networks; Early-exiting networks; Zero waste models.

MeSH terms

Databases, Factual
Motivation*
Neural Networks, Computer*