Comparison of radiologist versus natural language processing-based image annotations for deep learning system for tuberculosis screening on chest radiographs

Paul H Yi; Tae Kyung Kim; Cheng Ting Lin

doi:10.1016/j.clinimag.2022.04.009

Comparison of radiologist versus natural language processing-based image annotations for deep learning system for tuberculosis screening on chest radiographs

Clin Imaging. 2022 Jul:87:34-37. doi: 10.1016/j.clinimag.2022.04.009. Epub 2022 Apr 25.

Authors

Paul H Yi¹, Tae Kyung Kim², Cheng Ting Lin²

Affiliations

¹ University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, Baltimore, MD, United States of America; Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, MD, United States of America. Electronic address: pyi@som.umaryland.edu.
² The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, MD, United States of America.

PMID: 35483162
DOI: 10.1016/j.clinimag.2022.04.009

Abstract

Although natural language processing (NLP) can rapidly extract disease labels from radiology reports to create datasets for deep learning models, this may be less accurate than having radiologists manually review the images. In this study, we compared agreement between natural language processing (NLP) and radiologist-curated labels for possible tuberculosis (TB) on chest radiographs (CXR) and evaluated the performance of deep convolutional neural networks (DCNN) trained to identify TB using the preceding two sets of labels. We collected 10,951 CXRs from the NIH ChestX-ray14 dataset and labeled them as positive or negative for possible TB based on two methods: 1) NLP-derived disease labels and 2) radiologist-review of images. These images were used to train DCNNs on varying dataset sizes for possible TB and tested on an external dataset of 800 CXRs. Area under the ROC curve (AUC) was used to evaluate DCNNs. There was poor agreement between NLP and radiologist-curated labels for potential TB (Kappa coefficient 0.34). DCNNs trained using radiologist-curated labels had higher performance than the algorithm trained using the NLP-labels, regardless of the number of images used for training. The best-performing DCNN had an AUC of 0.88, which was trained on 10,951 images using the radiologist-annotated sets. DCNNs trained on CXRs labeled by a radiologist consistently outperformed those trained on the same CXRs labeled by NLP, highlighting the benefit of radiologists' determining groundtruth for machine learning dataset curation.

Keywords: Artificial intelligence; Chest radiographs; Deep learning; Natural language processing; Tuberculosis.

MeSH terms

Data Curation
Deep Learning*
Humans
Natural Language Processing
Radiography, Thoracic / methods
Radiologists
Retrospective Studies