Evaluating sentence representations for biomedical text: Methods and experimental results

Noha S Tawfik; Marco R Spruit

doi:10.1016/j.jbi.2020.103396

Evaluating sentence representations for biomedical text: Methods and experimental results

J Biomed Inform. 2020 Apr:104:103396. doi: 10.1016/j.jbi.2020.103396. Epub 2020 Mar 6.

Authors

Noha S Tawfik¹, Marco R Spruit²

Affiliations

¹ Computer Engineering Department, College of Engineering, Arab Academy for Science, Technology, and Maritime Transport (AAST), 1029 Alexandria, Egypt; Department of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, the Netherlands. Electronic address: noha.abdelsalam@aast.edu.
² Department of Information and Computing Sciences, Utrecht University, 3584 CC Utrecht, the Netherlands. Electronic address: m.r.spruit@uu.nl.

PMID: 32147441
DOI: 10.1016/j.jbi.2020.103396

Abstract

Text representations ar one of the main inputs to various Natural Language Processing (NLP) methods. Given the fast developmental pace of new sentence embedding methods, we argue that there is a need for a unified methodology to assess these different techniques in the biomedical domain. This work introduces a comprehensive evaluation of novel methods across ten medical classification tasks. The tasks cover a variety of BioNLP problems such as semantic similarity, question answering, citation sentiment analysis and others with binary and multi-class datasets. Our goal is to assess the transferability of different sentence representation schemes to the medical and clinical domain. Our analysis shows that embeddings based on Language Models which account for the context-dependent nature of words, usually outperform others in terms of performance. Nonetheless, there is no single embedding model that perfectly represents biomedical and clinical texts with consistent performance across all tasks. This illustrates the need for a more suitable bio-encoder. Our MedSentEval source code, pre-trained embeddings and examples have been made available on GitHub.

Keywords: BioNLP; Language model; Sentence embeddings; Text representation.

MeSH terms

Language*
Natural Language Processing*
Semantics
Software