Quantformer: Learning Extremely Low-Precision Vision Transformers

IEEE Trans Pattern Anal Mach Intell. 2023 Jul;45(7):8813-8826. doi: 10.1109/TPAMI.2022.3229313. Epub 2023 Jun 5.

Abstract

In this article, we propose extremely low-precision vision transformers called Quantformer for efficient inference. Conventional network quantization methods directly quantize weights and activations of fully-connected layers without considering properties of transformer architectures. Quantization sizably deviates the self-attention compared with full-precision counterparts, and the shared quantization strategy for diversely distributed patch features causes severe quantization errors. To address these issues, we enforce the self-attention rank in quantized transformers to mimic that in full-precision counterparts with capacity-aware distribution for information retention, and quantize patch features with group-wise discretization strategy for quantization error minimization. Specifically, we efficiently preserve the self-attention rank consistency by minimizing the distance between the self-attention in quantized and real-valued transformers with adaptive concentration degree, where the optimal concentration degree is selected according to the self-attention entropy for model capacity adaptation. Moreover, we partition patch features in different dimensions with differentiable group assignment, so that features in different groups leverage various discretization strategies with minimal rounding and clipping errors. Experimental results show that our Quantformer outperforms the state-of-the-art network quantization methods by a sizable margin across various vision transformer architectures in image classification and object detection. We also integrate our Quantformer with mixed-precision quantization to further enhance the performance of the vanilla models.