A real-world evaluation of the diagnostic accuracy of radiologists using positive predictive values verified from deep learning and natural language processing chest algorithms deployed retrospectively

Bahadar S Bhatia; John F Morlese; Sarah Yusuf; Yiting Xie; Bob Schallhorn; David Gruen

doi:10.1093/bjro/tzad009

A real-world evaluation of the diagnostic accuracy of radiologists using positive predictive values verified from deep learning and natural language processing chest algorithms deployed retrospectively

BJR Open. 2023 Dec 12;6(1):tzad009. doi: 10.1093/bjro/tzad009. eCollection 2024 Jan.

Authors

Bahadar S Bhatia^{1

2}, John F Morlese¹, Sarah Yusuf¹, Yiting Xie³, Bob Schallhorn³, David Gruen⁴

Affiliations

¹ Directorate of Diagnostic Radiology, Sandwell & West Birmingham NHS Trust, Lyndon, West Bromwich B71 4HJ, United Kingdom.
² Space Research Centre, Physics & Astronomy, University of Leicester, 92 Corporation Road, Leicester LE4 5SP, United Kingdom.
³ Merge, Merative (Formerly, IBM Watson Health Imaging), Ann Arbor, Michigan, MI 48108, United States.
⁴ Jefferson Radiology and Radiology Partners, 111 Founders Plaza, East Hartford, Connecticut CT 06108, United States.

Abstract

Objectives: This diagnostic study assessed the accuracy of radiologists retrospectively, using the deep learning and natural language processing chest algorithms implemented in Clinical Review version 3.2 for: pneumothorax, rib fractures in digital chest X-ray radiographs (CXR); aortic aneurysm, pulmonary nodules, emphysema, and pulmonary embolism in CT images.

Methods: The study design was double-blind (artificial intelligence [AI] algorithms and humans), retrospective, non-interventional, and at a single NHS Trust. Adult patients (≥18 years old) scheduled for CXR and CT were invited to enroll as participants through an opt-out process. Reports and images were de-identified, processed retrospectively, and AI-flagged discrepant findings were assigned to two lead radiologists, each blinded to patient identifiers and original radiologist. The radiologist's findings for each clinical condition were tallied as a verified discrepancy (true positive) or not (false positive).

Results: The missed findings were: 0.02% rib fractures, 0.51% aortic aneurysm, 0.32% pulmonary nodules, 0.92% emphysema, and 0.28% pulmonary embolism. The positive predictive values (PPVs) were: pneumothorax (0%), rib fractures (5.6%), aortic dilatation (43.2%), pulmonary emphysema (46.0%), pulmonary embolus (11.5%), and pulmonary nodules (9.2%). The PPV for pneumothorax was nil owing to lack of available studies that were analysed for outpatient activity.

Conclusions: The number of missed findings was far less than generally predicted. The chest algorithms deployed retrospectively were a useful quality tool and AI augmented the radiologists' workflow.

Advances in knowledge: The diagnostic accuracy of our radiologists generated missed findings of 0.02% for rib fractures CXR, 0.51% for aortic dilatation, 0.32% for pulmonary nodule, 0.92% for pulmonary emphysema, and 0.28% for pulmonary embolism for CT studies, all retrospectively evaluated with AI used as a quality tool to flag potential missed findings. It is important to account for prevalence of these chest conditions in clinical context and use appropriate clinical thresholds for decision-making, not relying solely on AI.

Keywords: aortic dilatation; deep learning; natural language processing; pneumothorax; pulmonary embolism; pulmonary emphysema; pulmonary nodule; rib fractures.