MadureseSet: Madurese-Indonesian Dataset

Noor Ifada; Fika Hastarita Rachman; M Wildan Mubarok Asy Syauqy; Sri Wahyuni; Adrian Pawitra

doi:10.1016/j.dib.2023.109035

MadureseSet: Madurese-Indonesian Dataset

Data Brief. 2023 Mar 7:48:109035. doi: 10.1016/j.dib.2023.109035. eCollection 2023 Jun.

Authors

Noor Ifada¹, Fika Hastarita Rachman¹, M Wildan Mubarok Asy Syauqy¹, Sri Wahyuni², Adrian Pawitra³

Affiliations

¹ Informatics Department, Engineering Faculty, University of Trunojoyo Madura, Bangkalan 69162, Indonesia.
² Mechatronics Department, Engineering Faculty, University of Trunojoyo Madura, Bangkalan 69162, Indonesia.
³ Yayasan Pragalba, Bangkalan, 69162, Indonesia.

Abstract

MadureseSet is a digitized version of the physical document of Kamus Lengkap Bahasa Madura-Indonesia (The Complete Dictionary of Madurese-Indonesian). It stores the list of lemmata in Madurese, i.e., 17809 basic lemmata and 53722 substitution lemmata, and their translation in Indonesian. The details of each lemma may include its pronunciation, part of speech, synonym and homonym relations, speech level, dialect, and loanword. The framework of dataset creation consists of three stages. First, the data extraction stage processes the scanned results of the physical document to produce corrected data in a text file. Second, the data structural review stage processes the text file in terms of the paragraph, homonym, synonym, linguistic, poem, short poem, proverb, and metaphor structures to create the data structure that best represents the information in the dictionary. Finally, the database construction stage builds the physical data model and populates the MadureseSet database. MadureseSet is validated by a Madurese language expert who is also the author of the physical document source of this dataset. Thus, this dataset can be a primary source for Natural Language Processing (NLP) research, especially for the Madurese language.

Keywords: Database; Dictionary; Indonesia; Madura; NLP.