PharmBERT: a domain-specific BERT model for drug labels

Taha ValizadehAslani; Yiwen Shi; Ping Ren; Jing Wang; Yi Zhang; Meng Hu; Liang Zhao; Hualou Liang

doi:10.1093/bib/bbad226

PharmBERT: a domain-specific BERT model for drug labels

Brief Bioinform. 2023 Jul 20;24(4):bbad226. doi: 10.1093/bib/bbad226.

Authors

Taha ValizadehAslani¹, Yiwen Shi², Ping Ren³, Jing Wang³, Yi Zhang³, Meng Hu³, Liang Zhao³, Hualou Liang⁴

Affiliations

¹ Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, USA.
² College of Computing and Informatics, Drexel University, Philadelphia, PA, USA.
³ Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, USA.
⁴ School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, USA.

PMID: 37317617
DOI: 10.1093/bib/bbad226

Abstract

Human prescription drug labeling contains a summary of the essential scientific information needed for the safe and effective use of the drug and includes the Prescribing Information, FDA-approved patient labeling (Medication Guides, Patient Package Inserts and/or Instructions for Use), and/or carton and container labeling. Drug labeling contains critical information about drug products, such as pharmacokinetics and adverse events. Automatic information extraction from drug labels may facilitate finding the adverse reaction of the drugs or finding the interaction of one drug with another drug. Natural language processing (NLP) techniques, especially recently developed Bidirectional Encoder Representations from Transformers (BERT), have exhibited exceptional merits in text-based information extraction. A common paradigm in training BERT is to pretrain the model on large unlabeled generic language corpora, so that the model learns the distribution of the words in the language, and then fine-tune on a downstream task. In this paper, first, we show the uniqueness of language used in drug labels, which therefore cannot be optimally handled by other BERT models. Then, we present the developed PharmBERT, which is a BERT model specifically pretrained on the drug labels (publicly available at Hugging Face). We demonstrate that our model outperforms the vanilla BERT, ClinicalBERT and BioBERT in multiple NLP tasks in the drug label domain. Moreover, how the domain-specific pretraining has contributed to the superior performance of PharmBERT is demonstrated by analyzing different layers of PharmBERT, and more insight into how it understands different linguistic aspects of the data is gained.

Keywords: BERT; drug label; natural language processing; pretraining.

Publication types

Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Drug Labeling*
Humans
Information Storage and Retrieval*
Learning
Natural Language Processing