PathologyBERT - Pre-trained Vs. A New Transformer Language Model for Pathology Domain

Thiago Santos; Amara Tariq; Susmita Das; Kavyasree Vayalpati; Geoffrey H Smith; Hari Trivedi; Imon Banerjee

PathologyBERT - Pre-trained Vs. A New Transformer Language Model for Pathology Domain

AMIA Annu Symp Proc. 2023 Apr 29:2022:962-971. eCollection 2022.

Authors

Thiago Santos¹, Amara Tariq², Susmita Das³, Kavyasree Vayalpati⁴, Geoffrey H Smith⁵, Hari Trivedi⁶, Imon Banerjee^{2

4}

Affiliations

¹ Emory University, Department of Computer Science, Atlanta, Georgia, USA.
² Mayo Clinic, Phoenix, Arizona, USA.
³ Indian Institute of Technology (IIT), Centre of Excellence in Artificial Intelligence, Kharagpur, West Bengal, India.
⁴ Arizona State University, School of Computing and Augmented Intelligence, Tempe, Arizona, USA.
⁵ Emory University, Department of Pathology, Atlanta, Georgia, USA.
⁶ Emory University, Department of Radiology, Atlanta, Georgia, USA.

PMID: 37128387
PMCID: PMC10148354

Abstract

Pathology text mining is a challenging task given the reporting variability and constant new findings in cancer sub-type definitions. However, successful text mining of a large pathology database can play a critical role to advance 'big data' cancer research like similarity-based treatment selection, case identification, prognostication, surveillance, clinical trial screening, risk stratification, and many others. While there is a growing interest in developing language models for more specific clinical domains, no pathology-specific language space exist to support the rapid data-mining development in pathology space. In literature, a few approaches fine-tuned general transformer models on specialized corpora while maintaining the original tokenizer, but in fields requiring specialized terminology, these models often fail to perform adequately. We propose PathologyBERT - a pre-trained masked language model which was trained on 347,173 histopathology specimen reports and publicly released in the Huggingface¹ repository². Our comprehensive experiments demonstrate that pre-training of transformer model on pathology corpora yields performance improvements on Natural Language Understanding (NLU) and Breast Cancer Diagnose Classification when compared to nonspecific language models.

MeSH terms

Big Data
Breast Neoplasms*
Data Mining
Female
Humans
Language
Natural Language Processing*