DeepLontar dataset for handwritten Balinese character detection and syllable recognition on Lontar manuscript

Sci Data. 2022 Dec 10;9(1):761. doi: 10.1038/s41597-022-01867-5.

Abstract

The digitalization of traditional Palmyra manuscripts, such as Lontar, is the government's main focus in efforts to preserve Balinese culture. Digitization is done by acquiring Lontar manuscripts through photos or scans. To understand Lontar's contents, experts usually carry out transliteration. Automatic transliteration using computer vision is generally carried out in several stages: character detection, character recognition, syllable recognition, and word recognition. Many methods can be used for detection and recognition, but they need data to train and evaluate the resulting model. In compiling the dataset, the data needs to be processed and labelled. This paper presented data collection and building datasets for detection and recognition tasks. Lontar was collected from libraries at universities in Bali. Data generation was carried out to produce 400 augmented images from 200 Lontar original images to increase the variousness of data. Annotations were performed to label each character producing over 100,000 characters in 55 character classes. This dataset can be used to train and evaluate performance in character detection and syllable recognition of new manuscripts.

MeSH terms

  • Handwriting*
  • Indonesia
  • Pattern Recognition, Automated* / methods