Self-supervised learning for interventional image analytics: toward robust device trackers

Saahil Islam; Venkatesh N Murthy; Dominik Neumann; Badhan Kumar Das; Puneet Sharma; Andreas Maier; Dorin Comaniciu; Florin C Ghesu

doi:10.1117/1.JMI.11.3.035001

Self-supervised learning for interventional image analytics: toward robust device trackers

J Med Imaging (Bellingham). 2024 May;11(3):035001. doi: 10.1117/1.JMI.11.3.035001. Epub 2024 May 15.

Authors

Saahil Islam^{1

2}, Venkatesh N Murthy³, Dominik Neumann², Badhan Kumar Das^{1

2}, Puneet Sharma³, Andreas Maier¹, Dorin Comaniciu³, Florin C Ghesu²

Affiliations

¹ Friedrich-Alexander-Universität Erlangen-Nürnberg, Pattern Recognition Lab, Erlangen, Germany.
² Siemens Healthineers, Digital Technology and Innovation, Erlangen, Germany.
³ Siemens Healthineers, Digital Technology and Innovation, Princeton, New Jersey, United States.

PMID: 38756438
PMCID: PMC11094643 (available on 2025-05-15)
DOI: 10.1117/1.JMI.11.3.035001

Abstract

Purpose: The accurate detection and tracking of devices, such as guiding catheters in live X-ray image acquisitions, are essential prerequisites for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness/no failures during tracking. To achieve this, one needs to efficiently tackle challenges, such as device obscuration by the contrast agent or other external devices or wires and changes in the field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion.

Approach: To overcome the aforementioned challenges, we propose an approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation-based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream in a light-weight model.

Results: Our approach achieves state-of-the-art performance, in particular for robustness, compared to ultra optimized reference solutions (that use multi-stage feature fusion or multi-task and flow regularization). The experiments show that our method achieves a 66.31% reduction in the maximum tracking error against the reference solutions (23.20% when flow regularization is used), achieving a success score of 97.95% at a $3 \times$ faster inference speed of 42 frames-per-second (on GPU). In addition, we achieve a 20% reduction in the standard deviation of errors, which indicates a much more stable tracking performance.

Conclusions: The proposed data-driven approach achieves superior performance, particularly in robustness and speed compared with the frequently used multi-modular approaches for device tracking. The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.

Keywords: device tracking; interventional imaging; self-supervised learning.