LAP: Latency-aware automated pruning with dynamic-based filter selection

Zailong Chen; Chubo Liu; Wangdong Yang; Kenli Li; Keqin Li

doi:10.1016/j.neunet.2022.05.002

LAP: Latency-aware automated pruning with dynamic-based filter selection

Neural Netw. 2022 Aug:152:407-418. doi: 10.1016/j.neunet.2022.05.002. Epub 2022 May 10.

Authors

Zailong Chen¹, Chubo Liu², Wangdong Yang³, Kenli Li⁴, Keqin Li⁵

Affiliations

¹ College of Information Science and Engineering, Hunan University, Hunan 410082, China. Electronic address: chenzl@hnu.edu.cn.
² College of Information Science and Engineering, Hunan University, Hunan 410082, China. Electronic address: liuchubo@hnu.edu.cn.
³ College of Information Science and Engineering, Hunan University, Hunan 410082, China. Electronic address: yangwd@hnu.edu.cn.
⁴ College of Information Science and Engineering, Hunan University, Hunan 410082, China. Electronic address: lkl@hnu.edu.cn.
⁵ Department of Computer Science, State University of New York, New Paltz, NY 12561, USA. Electronic address: lik@newpaltz.edu.

PMID: 35609502
DOI: 10.1016/j.neunet.2022.05.002

Abstract

Model pruning is widely used to compress and accelerate convolutional neural networks (CNNs). Conventional pruning techniques only focus on how to remove more parameters while ensuring model accuracy. This work not only covers the optimization of model accuracy, but also optimizes the model latency during pruning. When there are multiple optimization objectives, the difficulty of algorithm design increases exponentially. So latency sensitivity is proposed to effectively guide the determination of layer sparsity in this paper. We present the latency-aware automated pruning (LAP) framework which leverages the reinforcement learning to automatically determine the layer sparsity. Latency sensitivity is used as a prior knowledge and involved into the exploration loop. Rather than relying on a single reward signal such as validation accuracy or floating-point operations (FLOPs), our agent receives the feedback on the accuracy error and latency sensitivity. We also provide a novel filter selection algorithm to accurately distinguish important filters within a layer based on their dynamic changes. Compared to the state-of-the-art compression policies, our framework demonstrated superior performances for VGGNet, ResNet, and MobileNet on CIFAR-10, ImageNet, and Food-101. Our LAP allowed the inference latency of MobileNet-V1 to achieve approximately 1.64 times speedup on the Titan RTX GPU, with no loss of ImageNet Top-1 accuracy. It significantly improved the pareto optimal curve on the accuracy and latency trade-off.

Keywords: AutoML; Channel pruning; Model compression and acceleration; Reinforcement learning.

MeSH terms

Algorithms
Automation
Data Compression*
Neural Networks, Computer*