Towards a guideline for evaluation metrics in medical image segmentation

Dominik Müller; Iñaki Soto-Rey; Frank Kramer

doi:10.1186/s13104-022-06096-y

Towards a guideline for evaluation metrics in medical image segmentation

BMC Res Notes. 2022 Jun 20;15(1):210. doi: 10.1186/s13104-022-06096-y.

Authors

Dominik Müller^{1

2}, Iñaki Soto-Rey³, Frank Kramer⁴

Affiliations

¹ IT-Infrastructure for Translational Medical Research, University of Augsburg, Augsburg, Germany. dominik.mueller@informatik.uni-augsburg.de.
² Medical Data Integration Center, Institute for Digital Medicine, University Hospital Augsburg, Augsburg, Germany. dominik.mueller@informatik.uni-augsburg.de.
³ Medical Data Integration Center, Institute for Digital Medicine, University Hospital Augsburg, Augsburg, Germany.
⁴ IT-Infrastructure for Translational Medical Research, University of Augsburg, Augsburg, Germany.

Abstract

In the last decade, research on artificial intelligence has seen rapid growth with deep learning models, especially in the field of medical image segmentation. Various studies demonstrated that these models have powerful prediction capabilities and achieved similar results as clinicians. However, recent studies revealed that the evaluation in image segmentation studies lacks reliable model performance assessment and showed statistical bias by incorrect metric implementation or usage. Thus, this work provides an overview and interpretation guide on the following metrics for medical image segmentation evaluation in binary as well as multi-class problems: Dice similarity coefficient, Jaccard, Sensitivity, Specificity, Rand index, ROC curves, Cohen's Kappa, and Hausdorff distance. Furthermore, common issues like class imbalance and statistical as well as interpretation biases in evaluation are discussed. As a summary, we propose a guideline for standardized medical image segmentation evaluation to improve evaluation quality, reproducibility, and comparability in the research field.

Keywords: Biomedical image segmentation; Semantic segmentation; Medical Image Analysis; Evaluation; Guideline; Performance assessment; Reproducibility.

Publication types

Letter
Review

MeSH terms

Algorithms*
Artificial Intelligence*
Benchmarking
Image Processing, Computer-Assisted / methods
ROC Curve
Reproducibility of Results

Grants and funding

FKZ01ZZ1804E/Bundesministerium für Bildung und Forschung