Range of Radiologist Performance in a Population-based Screening Cohort of 1 Million Digital Mammography Examinations

Radiology. 2020 Oct;297(1):33-39. doi: 10.1148/radiol.2020192212. Epub 2020 Jul 28.

Abstract

Background There is great interest in developing artificial intelligence (AI)-based computer-aided detection (CAD) systems for use in screening mammography. Comparative performance benchmarks from true screening cohorts are needed. Purpose To determine the range of human first-reader performance measures within a population-based screening cohort of 1 million screening mammograms to gauge the performance of emerging AI CAD systems. Materials and Methods This retrospective study consisted of all screening mammograms in women aged 40-74 years in Stockholm County, Sweden, who underwent screening with full-field digital mammography between 2008 and 2015. There were 110 interpreting radiologists, of whom 24 were defined as high-volume readers (ie, those who interpreted more than 5000 annual screening mammograms). A true-positive finding was defined as the presence of a pathology-confirmed cancer within 12 months. Performance benchmarks included sensitivity and specificity, examined per quartile of radiologists' performance. First-reader sensitivity was determined for each tumor subgroup, overall and by quartile of high-volume reader sensitivity. Screening outcomes were examined based on the first reader's sensitivity quartile with 10 000 screening mammograms per quartile. Linear regression models were fitted to test for a linear trend across quartiles of performance. Results A total of 418 041 women (mean age, 54 years ± 10 [standard deviation]) were included, and 1 186 045 digital mammograms were evaluated, with 972 899 assessed by high-volume readers. Overall sensitivity was 73% (95% confidence interval [CI]: 69%, 77%), and overall specificity was 96% (95% CI: 95%, 97%). The mean values per quartile of high-volume reader performance ranged from 63% to 84% for sensitivity and from 95% to 98% for specificity. The sensitivity difference was very large for basal cancers, with the least sensitive and most sensitive high-volume readers detecting 53% and 89% of cancers, respectively (P < .001). Conclusion Benchmarks showed a wide range of performance differences between high-volume readers. Sensitivity varied by tumor characteristics. © RSNA, 2020 Online supplemental material is available for this article.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Adult
  • Aged
  • Artificial Intelligence*
  • Benchmarking
  • Breast Neoplasms / diagnostic imaging*
  • Clinical Competence*
  • Early Detection of Cancer
  • Female
  • Humans
  • Mammography
  • Mass Screening
  • Middle Aged
  • Retrospective Studies
  • Sensitivity and Specificity
  • Sweden