Range of Radiologist Performance in a Population-based Screening Cohort of 1 Million Digital Mammography Examinations

Mattie Salim; Karin Dembrower; Martin Eklund; Peter Lindholm; Fredrik Strand

doi:10.1148/radiol.2020192212

Range of Radiologist Performance in a Population-based Screening Cohort of 1 Million Digital Mammography Examinations

Radiology. 2020 Oct;297(1):33-39. doi: 10.1148/radiol.2020192212. Epub 2020 Jul 28.

Authors

Mattie Salim¹, Karin Dembrower¹, Martin Eklund¹, Peter Lindholm¹, Fredrik Strand¹

Affiliation

¹ From the Departments of Pathology and Oncology (M.S., F.S.), Physiology and Pharmacology (K.D., P.L.), and Medical Epidemiology and Biostatistics (M.E.), Karolinska Institute, Stockholm, Sweden; Department of Radiology (M.S.) and Breast Radiology (F.S.), Karolinska University Hospital, Dalagatan 90, 113 43 Stockholm, Sweden; and the Department of Radiology, Capio Sankt Görans Hospital, Stockholm, Sweden (K.D.).

PMID: 32720866
DOI: 10.1148/radiol.2020192212

Abstract

Background There is great interest in developing artificial intelligence (AI)-based computer-aided detection (CAD) systems for use in screening mammography. Comparative performance benchmarks from true screening cohorts are needed. Purpose To determine the range of human first-reader performance measures within a population-based screening cohort of 1 million screening mammograms to gauge the performance of emerging AI CAD systems. Materials and Methods This retrospective study consisted of all screening mammograms in women aged 40-74 years in Stockholm County, Sweden, who underwent screening with full-field digital mammography between 2008 and 2015. There were 110 interpreting radiologists, of whom 24 were defined as high-volume readers (ie, those who interpreted more than 5000 annual screening mammograms). A true-positive finding was defined as the presence of a pathology-confirmed cancer within 12 months. Performance benchmarks included sensitivity and specificity, examined per quartile of radiologists' performance. First-reader sensitivity was determined for each tumor subgroup, overall and by quartile of high-volume reader sensitivity. Screening outcomes were examined based on the first reader's sensitivity quartile with 10 000 screening mammograms per quartile. Linear regression models were fitted to test for a linear trend across quartiles of performance. Results A total of 418 041 women (mean age, 54 years ± 10 [standard deviation]) were included, and 1 186 045 digital mammograms were evaluated, with 972 899 assessed by high-volume readers. Overall sensitivity was 73% (95% confidence interval [CI]: 69%, 77%), and overall specificity was 96% (95% CI: 95%, 97%). The mean values per quartile of high-volume reader performance ranged from 63% to 84% for sensitivity and from 95% to 98% for specificity. The sensitivity difference was very large for basal cancers, with the least sensitive and most sensitive high-volume readers detecting 53% and 89% of cancers, respectively (P < .001). Conclusion Benchmarks showed a wide range of performance differences between high-volume readers. Sensitivity varied by tumor characteristics. © RSNA, 2020 Online supplemental material is available for this article.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Adult
Aged
Artificial Intelligence*
Benchmarking
Breast Neoplasms / diagnostic imaging*
Clinical Competence*
Early Detection of Cancer
Female
Humans
Mammography
Mass Screening
Middle Aged
Retrospective Studies
Sensitivity and Specificity
Sweden