Pashtu Language Digits Dataset

Data Brief. 2022 Oct 26:45:108701. doi: 10.1016/j.dib.2022.108701. eCollection 2022 Dec.

Abstract

Pashtu is a language spoken by 50 million people in the world [1]. It is the national language of Afghanistan and also spoken in the two largest provinces of Pakistan. It is a language written in complex way by calligraphers. Instead of enormous literature and research work in Optical Character Recognition for other languages of the world, this language still requires a mature optical character recognition system [2], [3]. A real dataset of Pashtu digits having 50000 scanned images is introduced and made publically available in this paper. All the digits in the images are handwritten images written and collected from faculty members, staff, and students of the Pak-Austria Fachhochschule, Institute of Applied Sciences and Technology, Pakistan. A total of 1250 candidates appeared in writing the text, out of which half are male and half female. The dataset will be publically available for research purposes.

Keywords: Machine Learning, ML; Machine learning; Natural Language Processing, NLP; Natural language processing; Optical character recognition; Pashtu Language Digits Dataset, PLDD; Text recognition.