Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati

Tanja Gaustad; Martin J Puttkammer

doi:10.1016/j.dib.2022.107994

Linguistically annotated dataset for four official South African languages with a conjunctive orthography: IsiNdebele, isiXhosa, isiZulu, and Siswati

Data Brief. 2022 Feb 25:41:107994. doi: 10.1016/j.dib.2022.107994. eCollection 2022 Apr.

Authors

Tanja Gaustad¹, Martin J Puttkammer¹

Affiliation

¹ Centre for Text Technology, North-West University, South Africa.

Abstract

This data article presents a linguistically annotated data set for four official South African languages with a conjunctive orthography, namely isiNdebele, isiXhosa, isiZulu and Siswati. The data set is parallel for all four languages and can be used for language-specific as well as cross-language development and evaluation of Natural Language Processing (NLP) core technologies. In addition, it can be used for corpus linguistic studies. The article describes how the data was collected, what type of texts it contains and it provides some details on the three different types of linguistic annotation added (morphology, part-of-speech and lemmas), including an example.

Keywords: Human language technology; Linguistic annotation; Natural language processing; Nguni languages.