Between Always and Never: Evaluating Uncertainty in Radiology Reports Using Natural Language Processing

J Digit Imaging. 2020 Oct;33(5):1194-1201. doi: 10.1007/s10278-020-00379-1. Epub 2020 Aug 19.

Abstract

The ideal radiology report reduces diagnostic uncertainty, while avoiding ambiguity whenever possible. The purpose of this study was to characterize the use of uncertainty terms in radiology reports at a single institution and compare the use of these terms across imaging modalities, anatomic sections, patient characteristics, and radiologist characteristics. We hypothesized that there would be variability among radiologists and between subspecialities within radiology regarding the use of uncertainty terms and that the length of the impression of a report would be a predictor of use of uncertainty terms. Finally, we hypothesized that use of uncertainty terms would often be interpreted by human readers as "hedging." To test these hypotheses, we applied a natural language processing (NLP) algorithm to assess and count the number of uncertainty terms within radiology reports. An algorithm was created to detect usage of a published set of uncertainty terms. All 642,569 radiology report impressions from 171 reporting radiologists were collected from 2011 through 2015. For validation, two radiologists without knowledge of the software algorithm reviewed report impressions and were asked to determine whether the report was "uncertain" or "hedging." The relationship between the presence of 1 or more uncertainty terms and the human readers' assessment was compared. There were significant differences in the proportion of reports containing uncertainty terms across patient admission status and across anatomic imaging subsections. Reports with uncertainty were significantly longer than those without, although report length was not significantly different between subspecialities or modalities. There were no significant differences in rates of uncertainty when comparing the experience of the attending radiologist. When compared with reader 1 as a gold standard, accuracy was 0.91, sensitivity was 0.92, specificity was 0.9, and precision was 0.88, with an F1-score of 0.9. When compared with reader 2, accuracy was 0.84, sensitivity was 0.88, specificity was 0.82, and precision was 0.68, with an F1-score of 0.77. Substantial variability exists among radiologists and subspecialities regarding the use of uncertainty terms, and this variability cannot be explained by years of radiologist experience or differences in proportions of specific modalities. Furthermore, detection of uncertainty terms demonstrates good test characteristics for predicting human readers' assessment of uncertainty.

Keywords: Diagnostic uncertainty; Natural language processing.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Humans
  • Natural Language Processing*
  • Radiology Information Systems*
  • Radiology*
  • Research Report
  • Uncertainty