Development and benchmarking of a Korean audio speech recognition model for Clinician-Patient conversations in radiation oncology clinics

Int J Med Inform. 2023 Aug:176:105112. doi: 10.1016/j.ijmedinf.2023.105112. Epub 2023 Jun 1.

Abstract

Background: The purpose of this study is to develop an audio speech recognition (ASR) deep learning model for transcribing clinician-patient conversations in radiation oncology clinics.

Methods: We finetuned the pre-trained English QuartzNet 15x5 model for the Korean language using a publicly available dataset of simulated situations between clinicians and patients. Subsequently, real conversations between a radiation oncologist and 115 patients in actual clinics were then prospectively collected, transcribed, and divided into training (30.26 h) and testing (0.79 h) sets. These datasets were used to develop the ASR model for clinics, which was benchmarked against other ASR models, including the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.'

Results: The pre-trained English ASR model was successfully fine-tuned and converted to recognize the Korean language, resulting in a character error rate (CER) of 0.17. However, we found that this performance was not sustained on the real conversation dataset. To address this, we further fine-tuned the model, resulting in an improved CER of 0.26. Other developed ASR models, including 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model.', showed a CER of 0.31, 0.28, and 0.25, respectively. On the general Korean conversation dataset, 'zeroth-korean,' our model showed a CER of 0.44, while the 'Whisper large,' the 'Riva Citrinet-1024 Korean model,' and the 'Riva Conformer Korean model' resulted in CERs of 0.26, 0.98, and 0.99, respectively.

Conclusion: In conclusion, we developed a Korean ASR model to transcribe real conversations between a radiation oncologist and patients. The performance of the model was deemed acceptable for both specific and general purposes, compared to other models. We anticipate that this model will reduce the time required for clinicians to document the patient's chief complaints or side effects.

Keywords: Audio speech recognition; Deep learning; Electronic health record.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Benchmarking
  • Humans
  • Language
  • Radiation Oncology*
  • Republic of Korea
  • Speech Perception*
  • Speech Recognition Software