Lightweight Scene Text Recognition Based on Transformer

Xin Luan; Jinwei Zhang; Miaomiao Xu; Wushouer Silamu; Yanbing Li

doi:10.3390/s23094490

Lightweight Scene Text Recognition Based on Transformer

Sensors (Basel). 2023 May 5;23(9):4490. doi: 10.3390/s23094490.

Authors

Xin Luan^{1

2

3}, Jinwei Zhang^{1

2

3}, Miaomiao Xu^{1

2

3}, Wushouer Silamu^{1

2

3}, Yanbing Li^{1

2

3}

Affiliations

¹ College of Information Science and Engineering, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.
² Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.
³ Xinjiang Multilingual Information Technology Research Center, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China.

Abstract

Scene text recognition (STR) has been a hot research field in computer vision, aiming to recognize text in natural scenes using computers. Currently, attention-based encoder-decoder frameworks struggle to precisely align feature regions with the target object when dealing with complex and low-quality images, a phenomenon known as attention drift. Additionally, with the rise of Transformer, the increasing size of parameters results in higher computational costs. In order to solve the above problems, based on the latest research results of Vision Transformer (ViT), we utilize an additional position-enhancement branch to alleviate attention drift and dynamically fused position information with visual information to achieve better recognition accuracy. The experimental results demonstrate that our model achieves a 3% higher average recognition accuracy on the test set compared to the baseline. Meanwhile, our model maintains the advantage of a small number of parameters and fast inference speed, achieving a good balance between accuracy, speed, and computational load.

Keywords: attention mechanism; scene text recognition; transformer.

Abstract

Grants and funding