DeepLontar dataset for handwritten Balinese character detection and syllable recognition on Lontar manuscript

Daniel Siahaan; Ni Putu Sutramiani; Nanik Suciati; I Nengah Duija; I Wayan Agus Surya Darma

doi:10.1038/s41597-022-01867-5

DeepLontar dataset for handwritten Balinese character detection and syllable recognition on Lontar manuscript

Sci Data. 2022 Dec 10;9(1):761. doi: 10.1038/s41597-022-01867-5.

Authors

Daniel Siahaan¹, Ni Putu Sutramiani^{1

2}, Nanik Suciati³, I Nengah Duija⁴, I Wayan Agus Surya Darma^{1

5}

Affiliations

¹ Department of Informatics, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia.
² Department of Information Technology, Faculty of Engineering, Universitas Udayana, Badung, 80361, Indonesia.
³ Department of Informatics, Faculty of Intelligent Electrical and Informatics Technology, Institut Teknologi Sepuluh Nopember, Surabaya, 60111, Indonesia. nanik@if.its.ac.id.
⁴ Department of Balinese Language Education, Postgraduate, Universitas Hindu Negeri I Gusti Bagus Sugriwa, Denpasar, 80236, Indonesia.
⁵ Department of Informatics, Faculty of Technology and Informatics, Institut Bisnis dan Teknologi Indonesia, Denpasar, 80225, Indonesia.

Abstract

The digitalization of traditional Palmyra manuscripts, such as Lontar, is the government's main focus in efforts to preserve Balinese culture. Digitization is done by acquiring Lontar manuscripts through photos or scans. To understand Lontar's contents, experts usually carry out transliteration. Automatic transliteration using computer vision is generally carried out in several stages: character detection, character recognition, syllable recognition, and word recognition. Many methods can be used for detection and recognition, but they need data to train and evaluate the resulting model. In compiling the dataset, the data needs to be processed and labelled. This paper presented data collection and building datasets for detection and recognition tasks. Lontar was collected from libraries at universities in Bali. Data generation was carried out to produce 400 augmented images from 200 Lontar original images to increase the variousness of data. Annotations were performed to label each character producing over 100,000 characters in 55 character classes. This dataset can be used to train and evaluate performance in character detection and syllable recognition of new manuscripts.

MeSH terms

Handwriting*
Indonesia
Pattern Recognition, Automated* / methods