Aligning words in French-English non-parallel medical texts: effect of term frequency distributions

Stud Health Technol Inform. 2004;107(Pt 1):23-7.

Abstract

In this paper, we present a method for aligning words based on a statistical model of word distribution similarity. The basis underlying our method is that there is a correlation between the patterns of word co-occurrences in texts of different languages. Using automatically downloaded pages from different medical web sites and a combined bilingual lexicon of general and medical terms as language sources, a similarity score is assigned to each proposed translated pair of words, based on the distributional contexts of these two words. We vary several parameters of the method. Experimental results confirm a positive effect of frequency, show that medical words are better handled than less specialized words, and do not evidence a clear influence of context window size. Future directions for improvement include working with very large, part-of-speech tagged corpora.

MeSH terms

  • Algorithms
  • Language*
  • Multilingualism
  • Natural Language Processing*
  • Terminology as Topic*
  • Translating*
  • Vocabulary, Controlled