Analysis of Language Embeddings for Classification of Unstructured Pathology Reports

Annu Int Conf IEEE Eng Med Biol Soc. 2021 Nov:2021:2378-2381. doi: 10.1109/EMBC46164.2021.9630347.

Abstract

A pathology report is one of the most significant medical documents providing interpretive insights into the visual appearance of the patient's biopsy sample. In digital pathology, high-resolution images of tissue samples are stored along with pathology reports. Despite the valuable information that pathology reports hold, they are not used in any systematic manner to promote computational pathology. In this work, we focus on analyzing the reports, which are generally unstructured documents written in English with sophisticated and highly specialized medical terminology. We provide a comparative analysis of various embedding models like BioBERT, Clinical BioBERT, BioMed-RoBERTa and Term Frequency-Inverse Document Frequency (TF-IDF), a traditional NLP technique, as well as the combination of embeddings from pre-trained models with TF-IDF. Our results demonstrate the effectiveness of various word embedding techniques for pathology reports.

MeSH terms

  • Humans
  • Language*
  • Natural Language Processing*
  • Writing