ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

Rina Buoy; Masakazu Iwamura; Sovila Srun; Koichi Kise

doi:10.3390/jimaging9120276

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

J Imaging. 2023 Dec 13;9(12):276. doi: 10.3390/jimaging9120276.

Authors

Rina Buoy¹, Masakazu Iwamura¹, Sovila Srun², Koichi Kise¹

Affiliations

¹ Department of Core Informatics, Graduate School of Informatics, Osaka Metropolitan University, Osaka 599-8531, Japan.
² Department of Information Technology Engineering, Faculty of Engineering, Royal University of Phnom Penh, Phnom Penh 12156, Cambodia.

Abstract

Attention-based encoder-decoder scene text recognition (STR) architectures have been proven effective in recognizing text in the real world, thanks to their ability to learn an internal language model. Nevertheless, the cross-attention operation that is used to align visual and linguistic features during decoding is computationally expensive, especially in low-resource environments. To address this bottleneck, we propose a cross-attention-free STR framework that still learns a language model. The framework we propose is ViTSTR-Transducer, which draws inspiration from ViTSTR, a vision transformer (ViT)-based method designed for STR and the recurrent neural network transducer (RNN-T) initially introduced for speech recognition. The experimental results show that our ViTSTR-Transducer models outperform the baseline attention-based models in terms of the required decoding floating point operations (FLOPs) and latency while achieving a comparable level of recognition accuracy. Compared with the baseline context-free ViTSTR models, our proposed models achieve superior recognition accuracy. Furthermore, compared with the recent state-of-the-art (SOTA) methods, our proposed models deliver competitive results.

Keywords: RNN-T; autoregressive language model; cross-attention; scene text recognition (STR); vision transformer (ViT).

Abstract

Grants and funding