A clinical specific BERT developed using a huge Japanese clinical text corpus

PLoS One. 2021 Nov 9;16(11):e0259763. doi: 10.1371/journal.pone.0259763. eCollection 2021.

Abstract

Generalized language models that are pre-trained with a large corpus have achieved great performance on natural language tasks. While many pre-trained transformers for English are published, few models are available for Japanese text, especially in clinical medicine. In this work, we demonstrate the development of a clinical specific BERT model with a huge amount of Japanese clinical text and evaluate it on the NTCIR-13 MedWeb that has fake Twitter messages regarding medical concerns with eight labels. Approximately 120 million clinical texts stored at the University of Tokyo Hospital were used as our dataset. The BERT-base was pre-trained using the entire dataset and a vocabulary including 25,000 tokens. The pre-training was almost saturated at about 4 epochs, and the accuracies of Masked-LM and Next Sentence Prediction were 0.773 and 0.975, respectively. The developed BERT did not show significantly higher performance on the MedWeb task than the other BERT models that were pre-trained with Japanese Wikipedia text. The advantage of pre-training on clinical text may become apparent in more complex tasks on actual clinical text, and such an evaluation set needs to be developed.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Clinical Medicine
  • Electric Power Supplies
  • Japan
  • Language*
  • Text Messaging

Grants and funding

This project was partly funded by the Japan Science and Technology Agency, Promoting Individual Research to Nurture the Seeds of Future Innovation and Organizing Unique, Innovative Network (JPMJPR1654). There were no other funders. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.