Morpheme matching based text tokenization for a scarce resourced language

Zobia Rehman; Waqas Anwar; Usama Ijaz Bajwa; Wang Xuan; Zhou Chaoying

doi:10.1371/journal.pone.0068178

Morpheme matching based text tokenization for a scarce resourced language

PLoS One. 2013 Aug 21;8(8):e68178. doi: 10.1371/journal.pone.0068178. eCollection 2013.

Authors

Zobia Rehman¹, Waqas Anwar, Usama Ijaz Bajwa, Wang Xuan, Zhou Chaoying

Affiliation

¹ Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan.

Abstract

Text tokenization is a fundamental pre-processing step for almost all the information processing applications. This task is nontrivial for the scarce resourced languages such as Urdu, as there is inconsistent use of space between words. In this paper a morpheme matching based approach has been proposed for Urdu text tokenization, along with some other algorithms to solve the additional issues of boundary detection of compound words, affixation, reduplication, names and abbreviations. This study resulted into 97.28% precision, 93.71% recall, and 95.46% F1-measure; while tokenizing a corpus of 57000 words by using a morpheme list with 6400 entries.

MeSH terms

Algorithms
Artificial Intelligence
Information Storage and Retrieval
Language*
Likelihood Functions
Names
Programming Languages*
Reproducibility of Results
Software

Grants and funding

The authors have no funding or support to report.