Assessment of bias in scoring of AI-based radiotherapy segmentation and planning studies using modified TRIPOD and PROBAST guidelines as an example

Coen Hurkmans; Jean-Emmanuel Bibault; Enrico Clementel; Jennifer Dhont; Wouter van Elmpt; Georgios Kantidakis; Nicolaus Andratschke

doi:10.1016/j.radonc.2024.110196

Assessment of bias in scoring of AI-based radiotherapy segmentation and planning studies using modified TRIPOD and PROBAST guidelines as an example

Radiother Oncol. 2024 May:194:110196. doi: 10.1016/j.radonc.2024.110196. Epub 2024 Mar 2.

Authors

Coen Hurkmans¹, Jean-Emmanuel Bibault², Enrico Clementel³, Jennifer Dhont⁴, Wouter van Elmpt⁵, Georgios Kantidakis³, Nicolaus Andratschke⁶

Affiliations

¹ Dept. of Radiation Oncology, Catharina Hospital Eindhoven, the Netherlands; Dept. of Electrical Engineering, Technical University Eindhoven, the Netherlands. Electronic address: coen.hurkmans@cze.nl.
² Dept. of Radiation Oncology, Hôpital Européen Georges Pompidou, Université Paris Cité, Paris, France.
³ European Organisation for the Research and Treatment of Cancer (EORTC), Brussels, Belgium.
⁴ Université libre de Bruxelles (ULB), Hôpital Universitaire de Bruxelles (H.U.B), Institut Jules Bordet, Department of Medical Physics, Brussels, Belgium; Université Libre De Bruxelles (ULB), Radiophysics and MRI Physics Laboratory, Brussels, Belgium.
⁵ Department of Radiation Oncology (MAASTRO), GROW - School for Oncology and Reproduction, Maastricht University Medical Center+, Maastricht, the Netherlands.
⁶ Dept. of Radiation Oncology, University Hospital of Zurich, The University of Zurich, Zurich, Switzerland.

PMID: 38432311
DOI: 10.1016/j.radonc.2024.110196

Abstract

Background and purpose: Studies investigating the application of Artificial Intelligence (AI) in the field of radiotherapy exhibit substantial variations in terms of quality. The goal of this study was to assess the amount of transparency and bias in scoring articles with a specific focus on AI based segmentation and treatment planning, using modified PROBAST and TRIPOD checklists, in order to provide recommendations for future guideline developers and reviewers.

Materials and methods: The TRIPOD and PROBAST checklist items were discussed and modified using a Delphi process. After consensus was reached, 2 groups of 3 co-authors scored 2 articles to evaluate usability and further optimize the adapted checklists. Finally, 10 articles were scored by all co-authors. Fleiss' kappa was calculated to assess the reliability of agreement between observers.

Results: Three of the 37 TRIPOD items and 5 of the 32 PROBAST items were deemed irrelevant. General terminology in the items (e.g., multivariable prediction model, predictors) was modified to align with AI-specific terms. After the first scoring round, further improvements of the items were formulated, e.g., by preventing the use of sub-questions or subjective words and adding clarifications on how to score an item. Using the final consensus list to score the 10 articles, only 2 out of the 61 items resulted in a statistically significant kappa of 0.4 or more demonstrating substantial agreement. For 41 items no statistically significant kappa was obtained indicating that the level of agreement among multiple observers is due to chance alone.

Conclusion: Our study showed low reliability scores with the adapted TRIPOD and PROBAST checklists. Although such checklists have shown great value during development and reporting, this raises concerns about the applicability of such checklists to objectively score scientific articles for AI applications. When developing or revising guidelines, it is essential to consider their applicability to score articles without introducing bias.

Keywords: Artificial intelligence; Bias; Checklists; Distinctiveness; Guidelines; Inter-observer variation; Oncology; Radiation therapy; Transparency.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Artificial Intelligence*
Bias
Checklist*
Delphi Technique*
Humans
Neoplasms / radiotherapy
Practice Guidelines as Topic
Radiotherapy Planning, Computer-Assisted* / methods
Radiotherapy Planning, Computer-Assisted* / standards
Reproducibility of Results