A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking

Nora Madi; Hend S Al-Khalifa

doi:10.1016/j.dib.2018.11.146

A7׳ta: Data on a monolingual Arabic parallel corpus for grammar checking

Data Brief. 2018 Dec 4:22:237-240. doi: 10.1016/j.dib.2018.11.146. eCollection 2019 Feb.

Authors

Nora Madi¹, Hend S Al-Khalifa¹

Affiliation

¹ Department of Information Technology, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia.

Abstract

Grammar error correction can be considered as a "translation" problem, such that an erroneous sentence is "translated" into a correct version of the sentence in the same language. This can be accomplished by employing techniques like Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Producing models for SMT or NMT for the goal of grammar correction requires monolingual parallel corpora of a certain language. This data article presents a monolingual parallel corpus of Arabic text called A7׳ta (). It contains 470 erroneous sentences and their 470 error-free counterparts. This is an Arabic parallel corpus that can be used as a linguistic resource for Arabic natural language processing (NLP) mainly to train sequence-to-sequence models for grammar checking. Sentences were manually collected from a book that has been prepared as a guide for correctly writing and using Arabic grammar and other linguistic features. Although there are a number of available Arabic corpora of errors and corrections [2] such as QALB [10] and Arabic Learner Corpus [11], the data we present in this article is an effort to increase the number of freely available Arabic corpora of errors and corrections by providing a detailed error specification and leveraging the work of language experts.

Keywords: Arabic language; Error checking; NLP; Parallel corpus.