Detection, segmentation, and 3D pose estimation of surgical tools using convolutional neural networks and algebraic geometry

Med Image Anal. 2021 May:70:101994. doi: 10.1016/j.media.2021.101994. Epub 2021 Feb 7.

Abstract

Background and objective: Surgical tool detection, segmentation, and 3D pose estimation are crucial components in Computer-Assisted Laparoscopy (CAL). The existing frameworks have two main limitations. First, they do not integrate all three components. Integration is critical; for instance, one should not attempt computing pose if detection is negative. Second, they have highly specific requirements, such as the availability of a CAD model. We propose an integrated and generic framework whose sole requirement for the 3D pose is that the tool shaft is cylindrical. Our framework makes the most of deep learning and geometric 3D vision by combining a proposed Convolutional Neural Network (CNN) with algebraic geometry. We show two applications of our framework in CAL: tool-aware rendering in Augmented Reality (AR) and tool-based 3D measurement.

Methods: We name our CNN as ART-Net (Augmented Reality Tool Network). It has a Single Input Multiple Output (SIMO) architecture with one encoder and multiple decoders to achieve detection, segmentation, and geometric primitive extraction. These primitives are the tool edge-lines, mid-line, and tip. They allow the tool's 3D pose to be estimated by a fast algebraic procedure. The framework only proceeds if a tool is detected. The accuracy of segmentation and geometric primitive extraction is boosted by a new Full resolution feature map Generator (FrG). We extensively evaluate the proposed framework with the EndoVis and new proposed datasets. We compare the segmentation results against several variants of the Fully Convolutional Network (FCN) and U-Net. Several ablation studies are provided for detection, segmentation, and geometric primitive extraction. The proposed datasets are surgery videos of different patients.

Results: In detection, ART-Net achieves 100.0% in both average precision and accuracy. In segmentation, it achieves 81.0% in mean Intersection over Union (mIoU) on the robotic EndoVis dataset (articulated tool), where it outperforms both FCN and U-Net, by 4.5pp and 2.9pp, respectively. It achieves 88.2% in mIoU on the remaining datasets (non-articulated tool). In geometric primitive extraction, ART-Net achieves 2.45 and 2.23 in mean Arc Length (mAL) error for the edge-lines and mid-line, respectively, and 9.3 pixels in mean Euclidean distance error for the tool-tip. Finally, in terms of 3D pose evaluated on animal data, our framework achieves 1.87 mm, 0.70 mm, and 4.80 mm mean absolute errors on the X,Y, and Z coordinates, respectively, and 5.94 angular error on the shaft orientation. It achieves 2.59 mm and 1.99 mm in mean and median location error of the tool head evaluated on patient data.

Conclusions: The proposed framework outperforms existing ones in detection and segmentation. Compared to separate networks, integrating the tasks in a single network preserves accuracy in detection and segmentation but substantially improves accuracy in geometric primitive extraction. Overall, our framework has similar or better accuracy in 3D pose estimation while largely improving robustness against the very challenging imaging conditions of laparoscopy. The source code of our framework and our annotated dataset will be made publicly available at https://github.com/kamruleee51/ART-Net.

Keywords: 3D pose; Algebraic geometry; Augmented reality; Computer-assisted laparoscopy; Deep learning; Segmentation.

MeSH terms

  • Humans
  • Image Processing, Computer-Assisted*
  • Neural Networks, Computer*