Training a Deep Contextualized Language Model for International Classification of Diseases, 10th Revision Classification via Federated Learning: Model Development and Validation Study

Pei-Fu Chen; Tai-Liang He; Sheng-Che Lin; Yuan-Chia Chu; Chen-Tsung Kuo; Feipei Lai; Ssu-Ming Wang; Wan-Xuan Zhu; Kuan-Chih Chen; Lu-Cheng Kuo; Fang-Ming Hung; Yu-Cheng Lin; I-Chang Tsai; Chi-Hao Chiu; Shu-Chih Chang; Chi-Yu Yang

doi:10.2196/41342

Training a Deep Contextualized Language Model for International Classification of Diseases, 10th Revision Classification via Federated Learning: Model Development and Validation Study

JMIR Med Inform. 2022 Nov 10;10(11):e41342. doi: 10.2196/41342.

Authors

Pei-Fu Chen^#^{1

2}, Tai-Liang He^#³, Sheng-Che Lin³, Yuan-Chia Chu^{4

5

6}, Chen-Tsung Kuo^{4

5

6}, Feipei Lai^{1

3

7}, Ssu-Ming Wang¹, Wan-Xuan Zhu⁸, Kuan-Chih Chen^{1

9}, Lu-Cheng Kuo¹⁰, Fang-Ming Hung^{11

12}, Yu-Cheng Lin^{13

14}, I-Chang Tsai¹⁵, Chi-Hao Chiu¹⁶, Shu-Chih Chang¹⁷, Chi-Yu Yang^{18

19}

Affiliations

¹ Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.
² Department of Anesthesiology, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
³ Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan.
⁴ Department of Information Management, Taipei Veterans General Hospital, Taipei City, Taiwan.
⁵ Medical Artificial Intelligence Development Center, Taipei Veterans General Hospital, Taipei City, Taiwan.
⁶ Department of Information Management, National Taipei University of Nursing and Health Sciences, Taipei City, Taiwan.
⁷ Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan.
⁸ Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan.
⁹ Department of Internal Medicine, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹⁰ Department of Internal Medicine, National Taiwan University Hospital, National Taiwan University College of Medicine, Taipei, Taiwan.
¹¹ Department of Medical Affairs, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹² Department of Surgical Intensive Care Unit, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹³ Department of Pediatrics, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹⁴ School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.
¹⁵ Artificial Intelligence Center, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹⁶ Section of Health Insurance, Department of Medical Affairs, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹⁷ Medical Records Department, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹⁸ Department of Information Technology, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
¹⁹ Section of Cardiovascular Medicine, Cardiovascular Center, Far Eastern Memorial Hospital, New Taipei City, Taiwan.

^# Contributed equally.

PMID: 36355417
PMCID: PMC9693720
DOI: 10.2196/41342

Abstract

Background: The automatic coding of clinical text documents by using the International Classification of Diseases, 10th Revision (ICD-10) can be performed for statistical analyses and reimbursements. With the development of natural language processing models, new transformer architectures with attention mechanisms have outperformed previous models. Although multicenter training may increase a model's performance and external validity, the privacy of clinical documents should be protected. We used federated learning to train a model with multicenter data, without sharing data per se.

Objective: This study aims to train a classification model via federated learning for ICD-10 multilabel classification.

Methods: Text data from discharge notes in electronic medical records were collected from the following three medical centers: Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital. After comparing the performance of different variants of bidirectional encoder representations from transformers (BERT), PubMedBERT was chosen for the word embeddings. With regard to preprocessing, the nonalphanumeric characters were retained because the model's performance decreased after the removal of these characters. To explain the outputs of our model, we added a label attention mechanism to the model architecture. The model was trained with data from each of the three hospitals separately and via federated learning. The models trained via federated learning and the models trained with local data were compared on a testing set that was composed of data from the three hospitals. The micro F₁ score was used to evaluate model performance across all 3 centers.

Results: The F₁ scores of PubMedBERT, RoBERTa (Robustly Optimized BERT Pretraining Approach), ClinicalBERT, and BioBERT (BERT for Biomedical Text Mining) were 0.735, 0.692, 0.711, and 0.721, respectively. The F₁ score of the model that retained nonalphanumeric characters was 0.8120, whereas the F₁ score after removing these characters was 0.7875-a decrease of 0.0245 (3.11%). The F₁ scores on the testing set were 0.6142, 0.4472, 0.5353, and 0.2522 for the federated learning, Far Eastern Memorial Hospital, National Taiwan University Hospital, and Taipei Veterans General Hospital models, respectively. The explainable predictions were displayed with highlighted input words via the label attention architecture.

Conclusions: Federated learning was used to train the ICD-10 classification model on multicenter clinical text while protecting data privacy. The model's performance was better than that of models that were trained locally.

Keywords: International Classification of Diseases; federated learning; machine learning; multilabel text classification; natural language processing.

©Pei-Fu Chen, Tai-Liang He, Sheng-Che Lin, Yuan-Chia Chu, Chen-Tsung Kuo, Feipei Lai, Ssu-Ming Wang, Wan-Xuan Zhu, Kuan-Chih Chen, Lu-Cheng Kuo, Fang-Ming Hung, Yu-Cheng Lin, I-Chang Tsai, Chi-Hao Chiu, Shu-Chih Chang, Chi-Yu Yang. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 10.11.2022.