Measuring the effect of different types of unsupervised word representations on Medical Named Entity Recognition

Arantza Casillas; Nerea Ezeiza; Iakes Goenaga; Alicia Pérez; Xabier Soto

doi:10.1016/j.ijmedinf.2019.05.022

Measuring the effect of different types of unsupervised word representations on Medical Named Entity Recognition

Int J Med Inform. 2019 Sep:129:100-106. doi: 10.1016/j.ijmedinf.2019.05.022. Epub 2019 Jun 5.

Authors

Arantza Casillas¹, Nerea Ezeiza², Iakes Goenaga³, Alicia Pérez⁴, Xabier Soto⁵

Affiliations

¹ IXA Group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: arantza.casillas@ehu.eus.
² IXA Group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: n.ezeiza@ehu.eus.
³ IXA Group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: iakes.goenaga@ehu.eus.
⁴ IXA Group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: alicia.perez@ehu.eus.
⁵ IXA Group, University of the Basque Country (UPV-EHU), Manuel Lardizabal 1, 20080 Donostia, Spain. Electronic address: xabier.soto@ehu.eus.

PMID: 31445243
DOI: 10.1016/j.ijmedinf.2019.05.022

Abstract

Background: This work deals with Natural Language Processing applied to the clinical domain. Specifically, the work deals with a Medical Entity Recognition (MER) on Electronic Health Records (EHRs). Developing a MER system entailed heavy data preprocessing and feature engineering until Deep Neural Networks (DNNs) emerged. However, the quality of the word representations in terms of embedded layers is still an important issue for the inference of the DNNs.

Goal: The main goal of this work is to develop a robust MER system adapting general-purpose DNNs to cope with the high lexical variability shown in EHRs. In addition, given that EHRs tend to be scarce when there are out-domain corpora available, the aim is to assess the impact of the word representations on the performance of the MER as we move to other domains. In this line, exhaustive experimentation varying information generation methods and network parameters are crucial.

Methods: We adapted a general purpose sequential tagger based on Bidirectional Long-Short Term Memory cells and Conditional Random Fields (CRFs) in order to make it tolerant to high lexical variability and a limited amount of corpora. To this end, we incorporated part of speech (POS) and semantic-tag embedding layers to the word representations.

Results: One of the strengths of this work is the exhaustive evaluation of dense word representations obtained varying not only the domain and genre but also the learning algorithms and their parameter settings. With the proposed method, we attained an error reduction of 1.71 (5.7%) compared to the state-of-the-art even that no preprocessing or feature engineering was used.

Conclusions: Our results indicate that dense representations built taking word order into account leverage the entity extraction system. Besides, we found that using a medical corpus (not necessarily EHRs) to infer the representations improves the performance, even if it does not correspond to the same genre.

Keywords: Electronic Health Records; Health Information Systems; Medical Named Entity Recognition; Neural network.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Electronic Health Records
Natural Language Processing*
Neural Networks, Computer
Semantics
Subject Headings