Gradient-Based Saliency Maps Are Not Trustworthy Visual Explanations of Automated AI Musculoskeletal Diagnoses

Kesavan Venkatesh; Simukayi Mutasa; Fletcher Moore; Jeremias Sulam; Paul H Yi

doi:10.1007/s10278-024-01136-4

Gradient-Based Saliency Maps Are Not Trustworthy Visual Explanations of Automated AI Musculoskeletal Diagnoses

J Imaging Inform Med. 2024 May 6. doi: 10.1007/s10278-024-01136-4. Online ahead of print.

Authors

Kesavan Venkatesh¹, Simukayi Mutasa², Fletcher Moore², Jeremias Sulam¹, Paul H Yi³

Affiliations

¹ Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.
² University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 520 W Lombard St, Baltimore, MD, USA.
³ University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 520 W Lombard St, Baltimore, MD, USA. pyi@som.umaryland.edu.

PMID: 38710971
DOI: 10.1007/s10278-024-01136-4

Abstract

Saliency maps are popularly used to "explain" decisions made by modern machine learning models, including deep convolutional neural networks (DCNNs). While the resulting heatmaps purportedly indicate important image features, their "trustworthiness," i.e., utility and robustness, has not been evaluated for musculoskeletal imaging. The purpose of this study was to systematically evaluate the trustworthiness of saliency maps used in disease diagnosis on upper extremity X-ray images. The underlying DCNNs were trained using the Stanford MURA dataset. We studied four trustworthiness criteria-(1) localization accuracy of abnormalities, (2) repeatability, (3) reproducibility, and (4) sensitivity to underlying DCNN weights-across six different gradient-based saliency methods (Grad-CAM (GCAM), gradient explanation (GRAD), integrated gradients (IG), Smoothgrad (SG), smooth IG (SIG), and XRAI). Ground-truth was defined by the consensus of three fellowship-trained musculoskeletal radiologists who each placed bounding boxes around abnormalities on a holdout saliency test set. Compared to radiologists, all saliency methods showed inferior localization (AUPRCs: 0.438 (SG)-0.590 (XRAI); average radiologist AUPRC: 0.816), repeatability (IoUs: 0.427 (SG)-0.551 (IG); average radiologist IOU: 0.613), and reproducibility (IoUs: 0.250 (SG)-0.502 (XRAI); average radiologist IOU: 0.613) on abnormalities such as fractures, orthopedic hardware insertions, and arthritis. Five methods (GCAM, GRAD, IG, SG, XRAI) passed the sensitivity test. Ultimately, no saliency method met all four trustworthiness criteria; therefore, we recommend caution and rigorous evaluation of saliency maps prior to their clinical use.

Keywords: Deep learning; Musculoskeletal radiology; Saliency methods; Trustworthy AI.

Grants and funding

2239787/Division of Computing and Communication Foundations