GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition

IEEE Trans Neural Netw Learn Syst. 2023 Mar 6:PP. doi: 10.1109/TNNLS.2023.3239696. Online ahead of print.

Abstract

Recent years have witnessed the growing popularity of connectionist temporal classification (CTC) and attention mechanism in scene text recognition (STR). CTC-based methods consume less time with few computational burdens, while they are not as effective as attention-based methods. To retain computational efficiency and effectiveness, we propose the global-local attention-augmented light Transformer (GLaLT), which adopts a Transformer-based encoder-decoder structure to orchestrate CTC and attention mechanism. The encoder integrates the self-attention module with the convolution module to augment the attention, where the self-attention module pays more attention to capturing long-term global dependencies and the convolution module focuses on local context modeling. The decoder consists of two parallel modules: one is the Transformer-decoder-based attention module and the other is the CTC module. The first one is removed in the testing phase and can guide the second one to extract robust features in the training phase. Extensive experiments on standard benchmarks demonstrate that GLaLT achieves state-of-the-art performance for both regular and irregular STR. In terms of tradeoffs, the proposed GLaLT is at or near the frontiers for maximizing speed, accuracy, and computational efficiency at the same time.