Dataset for comparable evaluation of machine translation between 11 South African languages

Data Brief. 2020 Jan 14:29:105146. doi: 10.1016/j.dib.2020.105146. eCollection 2020 Apr.

Abstract

This data article describes the Autshumato machine translation evaluation set. The evaluation set contains data that can be used to evaluate machine translation systems between any of the 11 official South African languages. The dataset is parallel with four reference translations available for each of the following languages: Afrikaans, English, isiNdebele, isiXhosa, isiZulu, Sepedi, Sesotho, Setswana, Siswati, Tshivenḓa and Xitsonga.

Keywords: Automatic evaluation; Human language technology; Machine translation; Natural language processing.