Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study

Lauren Oakden-Rayner; William Gale; Thomas A Bonham; Matthew P Lungren; Gustavo Carneiro; Andrew P Bradley; Lyle J Palmer

doi:10.1016/S2589-7500(22)00004-8

Validation and algorithmic audit of a deep learning system for the detection of proximal femoral fractures in patients in the emergency department: a diagnostic accuracy study

Lancet Digit Health. 2022 May;4(5):e351-e358. doi: 10.1016/S2589-7500(22)00004-8. Epub 2022 Apr 5.

Authors

Lauren Oakden-Rayner¹, William Gale², Thomas A Bonham³, Matthew P Lungren⁴, Gustavo Carneiro⁵, Andrew P Bradley⁶, Lyle J Palmer⁷

Affiliations

¹ School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia. Electronic address: lauren.oakden-rayner@adelaide.edu.au.
² Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia; School of Computer Science, University of Adelaide, Adelaide, SA, Australia.
³ Stanford University School of Medicine, Department of Radiology, Stanford, CA, USA.
⁴ Stanford University School of Medicine, Department of Radiology, Stanford, CA, USA; Stanford Artificial Intelligence in Medicine and Imaging Center, Stanford University, Stanford, CA, USA.
⁵ Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia.
⁶ Science and Engineering Faculty, Queensland University of Technology, Brisbane, QLD, Australia.
⁷ School of Public Health, University of Adelaide, Adelaide, SA, Australia; Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia.

PMID: 35396184
DOI: 10.1016/S2589-7500(22)00004-8

Abstract

Background: Proximal femoral fractures are an important clinical and public health issue associated with substantial morbidity and early mortality. Artificial intelligence might offer improved diagnostic accuracy for these fractures, but typical approaches to testing of artificial intelligence models can underestimate the risks of artificial intelligence-based diagnostic systems.

Methods: We present a preclinical evaluation of a deep learning model intended to detect proximal femoral fractures in frontal x-ray films in emergency department patients, trained on films from the Royal Adelaide Hospital (Adelaide, SA, Australia). This evaluation included a reader study comparing the performance of the model against five radiologists (three musculoskeletal specialists and two general radiologists) on a dataset of 200 fracture cases and 200 non-fractures (also from the Royal Adelaide Hospital), an external validation study using a dataset obtained from Stanford University Medical Center, CA, USA, and an algorithmic audit to detect any unusual or unexpected model behaviour.

Findings: In the reader study, the area under the receiver operating characteristic curve (AUC) for the performance of the deep learning model was 0·994 (95% CI 0·988-0·999) compared with an AUC of 0·969 (0·960-0·978) for the five radiologists. This strong model performance was maintained on external validation, with an AUC of 0·980 (0·931-1·000). However, the preclinical evaluation identified barriers to safe deployment, including a substantial shift in the model operating point on external validation and an increased error rate on cases with abnormal bones (eg, Paget's disease).

Interpretation: The model outperformed the radiologists tested and maintained performance on external validation, but showed several unexpected limitations during further testing. Thorough preclinical evaluation of artificial intelligence models, including algorithmic auditing, can reveal unexpected and potentially harmful behaviour even in high-performance artificial intelligence systems, which can inform future clinical testing and deployment decisions.

Funding: None.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence
Deep Learning*
Emergency Service, Hospital
Femoral Fractures* / diagnostic imaging
Humans
Retrospective Studies