Unsupervised feature disentanglement for video retrieval in minimally invasive surgery

Ziyi Wang; Bo Lu; Xiaojie Gao; Yueming Jin; Zerui Wang; Tak Hong Cheung; Pheng Ann Heng; Qi Dou; Yunhui Liu

doi:10.1016/j.media.2021.102296

Unsupervised feature disentanglement for video retrieval in minimally invasive surgery

Med Image Anal. 2022 Jan:75:102296. doi: 10.1016/j.media.2021.102296. Epub 2021 Nov 3.

Authors

Ziyi Wang¹, Bo Lu², Xiaojie Gao³, Yueming Jin³, Zerui Wang⁴, Tak Hong Cheung⁵, Pheng Ann Heng⁶, Qi Dou⁷, Yunhui Liu¹

Affiliations

¹ Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, China; T Stone Robotics Institute, The Chinese University of Hong Kong, China.
² Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, China; T Stone Robotics Institute, The Chinese University of Hong Kong, China; Robotics and Microsystems Center, School of Mechanical and Electric Engineering, Soochow University, Suzhou, China.
³ Department of Computer Science and Engineering, The Chinese University of Hong Kong, China.
⁴ Cornerstone Robotics Limited, Shatin, Hong Kong, China.
⁵ Department of Obstetrics and Gynaecology, Prince of Wales Hospital, The Chinese University of Hong Kong, China.
⁶ Department of Computer Science and Engineering, The Chinese University of Hong Kong, China; T Stone Robotics Institute, The Chinese University of Hong Kong, China.
⁷ Department of Computer Science and Engineering, The Chinese University of Hong Kong, China; T Stone Robotics Institute, The Chinese University of Hong Kong, China. Electronic address: qdou@cse.cuhk.edu.hk.

PMID: 34781159
DOI: 10.1016/j.media.2021.102296

Abstract

In this paper, we propose a novel method of Unsupervised Disentanglement of Scene and Motion (UDSM) representations for minimally invasive surgery video retrieval within large databases, which has the potential to advance intelligent and efficient surgical teaching systems. To extract more discriminative video representations, two designed encoders with a triplet ranking loss and an adversarial learning mechanism are established to respectively capture the spatial and temporal information for achieving disentangled features from each frame with promising interpretability. In addition, the long-range temporal dependencies are improved in an integrated video level using a temporal aggregation module and then a set of compact binary codes that carries representative features is yielded to realize fast retrieval. The entire framework is trained in an unsupervised scheme, i.e., purely learning from raw surgical videos without using any annotation. We construct two large-scale minimally invasive surgery video datasets based on the public dataset Cholec80 and our in-house dataset of laparoscopic hysterectomy, to establish the learning process and validate the effectiveness of our proposed method qualitatively and quantitatively on the surgical video retrieval task. Extensive experiments show that our approach significantly outperforms the state-of-the-art video retrieval methods on both datasets, revealing a promising future for injecting intelligence in the next generation of surgical teaching systems.

Keywords: Disentangled representation; Learning-based hashing; Surgical video analysis; Unsupervised video retrieval.

Publication types

Research Support, Non-U.S. Gov't
Video-Audio Media

MeSH terms

Databases, Factual
Humans
Minimally Invasive Surgical Procedures*
Motion