Large-scale application of named entity recognition to biomedicine and epidemiology

Shaina Raza; Deepak John Reji; Femi Shajan; Syed Raza Bashir

doi:10.1371/journal.pdig.0000152

Large-scale application of named entity recognition to biomedicine and epidemiology

PLOS Digit Health. 2022 Dec 7;1(12):e0000152. doi: 10.1371/journal.pdig.0000152. eCollection 2022 Dec.

Authors

Shaina Raza¹, Deepak John Reji², Femi Shajan², Syed Raza Bashir³

Affiliations

¹ Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada.
² Environmental Resources Management, Bangalore, India.
³ Toronto Metropolitan University, Toronto, Ontario, Canada.

Abstract

Background: Despite significant advancements in biomedical named entity recognition methods, the clinical application of these systems continues to face many challenges: (1) most of the methods are trained on a limited set of clinical entities; (2) these methods are heavily reliant on a large amount of data for both pre-training and prediction, making their use in production impractical; (3) they do not consider non-clinical entities, which are also related to patient's health, such as social, economic or demographic factors.

Methods: In this paper, we develop Bio-Epidemiology-NER (https://pypi.org/project/Bio-Epidemiology-NER/) an open-source Python package for detecting biomedical named entities from the text. This approach is based on a Transformer-based system and trained on a dataset that is annotated with many named entities (medical, clinical, biomedical, and epidemiological). This approach improves on previous efforts in three ways: (1) it recognizes many clinical entity types, such as medical risk factors, vital signs, drugs, and biological functions; (2) it is easily configurable, reusable, and can scale up for training and inference; (3) it also considers non-clinical factors (age and gender, race and social history and so) that influence health outcomes. At a high level, it consists of the phases: pre-processing, data parsing, named entity recognition, and named entity enhancement.

Results: Experimental results show that our pipeline outperforms other methods on three benchmark datasets with macro-and micro average F1 scores around 90 percent and above.

Conclusion: This package is made publicly available for researchers, doctors, clinicians, and anyone to extract biomedical named entities from unstructured biomedical texts.

Copyright: © 2022 Raza et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Grants and funding

The authors received no specific funding for this work.