An International Non-Inferiority Study for the Benchmarking of AI for Routine Radiology Cases: Chest X-ray, Fluorography and Mammography

Kirill Arzamasov; Yuriy Vasilev; Anton Vladzymyrskyy; Olga Omelyanskaya; Igor Shulkin; Darya Kozikhina; Inna Goncharova; Pavel Gelezhe; Yury Kirpichev; Tatiana Bobrovskaya; Anna Andreychenko

doi:10.3390/healthcare11121684

An International Non-Inferiority Study for the Benchmarking of AI for Routine Radiology Cases: Chest X-ray, Fluorography and Mammography

Healthcare (Basel). 2023 Jun 8;11(12):1684. doi: 10.3390/healthcare11121684.

Authors

Affiliations

¹ State Budget-Funded Health Care Institution of the City of Moscow "Research and Practical Clinical Center for Diagnostics and Telemedicine Technologies of the Moscow Health Care Department", Petrovka Street, 24, Building 1, 127051 Moscow, Russia.
² Federal State Budgetary Institution "National Medical and Surgical Center Named after N.I. Pirogov" of the Ministry of Health of the Russian Federation, Nizhnyaya Pervomayskaya Street, 70, 105203 Moscow, Russia.
³ Department of Information and Internet Technologies, I.M. Sechenov First Moscow State Medical University of the Ministry of Health of the Russian Federation (Sechenov University), Trubetskaya Street, 8, Building 2, 119991 Moscow, Russia.

Abstract

An international reader study was conducted to gauge an average diagnostic accuracy of radiologists interpreting chest X-ray images, including those from fluorography and mammography, and establish requirements for stand-alone radiological artificial intelligence (AI) models. The retrospective studies in the datasets were labelled as containing or not containing target pathological findings based on a consensus of two experienced radiologists, and the results of a laboratory test and follow-up examination, where applicable. A total of 204 radiologists from 11 countries with various experience performed an assessment of the dataset with a 5-point Likert scale via a web platform. Eight commercial radiological AI models analyzed the same dataset. The AI AUROC was 0.87 (95% CI:0.83-0.9) versus 0.96 (95% CI 0.94-0.97) for radiologists. The sensitivity and specificity of AI versus radiologists were 0.71 (95% CI 0.64-0.78) versus 0.91 (95% CI 0.86-0.95) and 0.93 (95% CI 0.89-0.96) versus 0.9 (95% CI 0.85-0.94) for AI. The overall diagnostic accuracy of radiologists was superior to AI for chest X-ray and mammography. However, the accuracy of AI was noninferior to the least experienced radiologists for mammography and fluorography, and to all radiologists for chest X-ray. Therefore, an AI-based first reading could be recommended to reduce the workload burden of radiologists for the most common radiological studies such as chest X-ray and mammography.

Keywords: benchmarking; population screening; radiology; stand-alone artificial intelligence.

Grants and funding

№1409-1/22 (07 OCT 2022)/Moscow center for healthcare innovations