Arabic punctuation dataset

Sane Yagi; Ashraf Elnagar; Esra Yaghi

doi:10.1016/j.dib.2024.110118

Arabic punctuation dataset

Data Brief. 2024 Feb 1:53:110118. doi: 10.1016/j.dib.2024.110118. eCollection 2024 Apr.

Authors

Sane Yagi¹, Ashraf Elnagar², Esra Yaghi³

Affiliations

¹ Department of Foreign Languages, University of Sharjah, the United Arab Emirates.
² Department of Computer Science, University of Sharjah, the United Arab Emirates.
³ Department of Linguistics, University of Waikato, Hamilton, New Zealand.

Abstract

Arabic, unlike many languages, suffers from punctuation inconsistency, posing a significant obstacle for Natural Language Processing (NLP). To address this, we present the Arabic Punctuation Dataset (APD), a large collection of annotated Modern Standard Arabic texts designed to train machine learning models in sentence boundary identification and punctuation prediction. APD leverages the "theme-rheme completion" principle, a grammatical feature closely linked to consistent punctuation placement. It consists of an annotated collection of Modern Standard Arabic (MSA) texts that encompass 312 million words in approximately 12 million sentences. It comprises three diverse components: Arabic Book Chapters (ABC): Manually annotated, non-fiction, book excerpts, constituting a gold-standard reference. Complete Book Translations (CBT): Parallel English-Arabic book translations with aligned sentence endings, ideal for machine translation training. Scrambled Sentences from the Arabic Component of the United Nations Parallel Corpus (SSAC-UNPC): Jumbled sentences for model training in automatic punctuation restoration. Beyond NLP, APD serves as a valuable resource for linguistics research, language learning, and real-time subtitling. Its authentic, grammar-based approach can enhance the readability and clarity of machine-generated text, opening doors for various applications such as automatic speech recognition, text summarization, and machine translation.

Keywords: Automatic punctuation; Punctuation corpus; Sentence boundary identification; Theme-rheme; Topic and comment.