Assessment of a deep learning model for COVID-19 classification on chest radiographs: a comparison across image acquisition techniques and clinical factors

J Med Imaging (Bellingham). 2023 Nov;10(6):064504. doi: 10.1117/1.JMI.10.6.064504. Epub 2023 Dec 28.

Abstract

Purpose: The purpose is to assess the performance of a pre-trained deep learning model in the task of classifying between coronavirus disease (COVID)-positive and COVID-negative patients from chest radiographs (CXRs) while considering various image acquisition parameters, clinical factors, and patient demographics.

Methods: Standard and soft-tissue CXRs of 9860 patients comprised the "original dataset," consisting of training and test sets and were used to train a DenseNet-121 architecture model to classify COVID-19 using three classification algorithms: standard, soft tissue, and a combination of both types of images via feature fusion. A larger more-current test set of 5893 patients (the "current test set") was used to assess the performance of the pretrained model. The current test set contained a larger span of dates, incorporated different variants of the virus and included different immunization statuses. Model performance between the original and current test sets was evaluated using area under the receiver operating characteristic curve (ROC AUC) [95% CI].

Results: The model achieved AUC values of 0.67 [0.65, 0.70] for cropped standard images, 0.65 [0.63, 0.67] for cropped soft-tissue images, and 0.67 [0.65, 0.69] for both types of cropped images. These were all significantly lower than the performance of the model on the original test set. Investigations regarding matching the acquisition dates between the test sets (i.e., controlling for virus variants), immunization status, disease severity, and age and sex distributions did not fully explain the discrepancy in performance.

Conclusions: Several relevant factors were considered to determine whether differences existed in the test sets, including time period of image acquisition, vaccination status, and disease severity. The lower performance on the current test set may have occurred due to model overfitting and a lack of generalizability.

Keywords: COVID-19; chest radiography; deep learning; generalizability.