Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation

Zhihong Lin; Donghao Zhang; Danli Shi; Renjing Xu; Qingyi Tao; Lin Wu; Mingguang He; Zongyuan Ge

doi:10.1016/j.jbi.2023.104281

Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation

J Biomed Inform. 2023 Feb:138:104281. doi: 10.1016/j.jbi.2023.104281. Epub 2023 Jan 10.

Authors

Zhihong Lin¹, Donghao Zhang², Danli Shi³, Renjing Xu⁴, Qingyi Tao⁵, Lin Wu⁶, Mingguang He⁷, Zongyuan Ge⁸

Affiliations

¹ Faculty of Engineering, Monash University, Clayton, VIC, 3800, Australia. Electronic address: zhihong.lin@monash.edu.
² Monash eResearch Center, Monash University, Clayton, VIC, 3800, Australia. Electronic address: donghao.zhang@monash.edu.
³ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-Sen University, Guangzhou, 510060, China. Electronic address: shidli@mail2.sysu.edu.cn.
⁴ Microelectronics Thrust, The Hong Kong University of Science and Technology (Guangzhou), Nansha, Guangzhou, Guangdong, 511400, China. Electronic address: renjingxu@ust.hk.
⁵ NVIDIA AI Technology Center, 038988, Singapore. Electronic address: qtao002@e.ntu.edu.sg.
⁶ School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, 230000, China. Electronic address: jolin.lwu@gmail.com.
⁷ Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, VIC, 3002, Australia. Electronic address: mingguang.he@unimelb.edu.au.
⁸ Monash eResearch Center, Monash University, Clayton, VIC, 3800, Australia. Electronic address: zongyuan.ge@monash.edu.

PMID: 36638935
DOI: 10.1016/j.jbi.2023.104281

Abstract

Interpreting medical images such as chest X-ray images and retina images is an essential step for diagnosing and treating relevant diseases. Proposing automatic and reliable medical report generation systems can reduce the time-consuming workload, improve efficiencies of clinical workflows, and decrease practical variations between different clinical professionals. Many recent approaches based on image-encoder and language-decoder structure have been proposed to tackle this task. However, some technical challenges remain to be solved, including the fusion efficacy between the language and visual cues and the difficulty of obtaining an effective pre-trained image feature extractor for medical-specific tasks. In this work, we proposed the weighted query-key interacting attention module, including both the second-order and first-order interactions. Compared with the conventional scaled dot-product attention, this design generates a strong fusion mechanism between language and visual signals. In addition, we also proposed the contrastive pre-training step to reduce the domain gap between the image encoder and the target dataset. To test the generalizability of our learning scheme, we collected and verified our model on the world-first multi-modality retina report generation dataset referred to as Retina ImBank and another large-scale retina Chinese-based report dataset referred to as Retina Chinese. These two datasets will be made publicly available and serve as benchmarks to encourage further research exploration in this field. From our experimental results, we demonstrate that our proposed method has outperformed multiple state-of-the-art image captioning and medical report generation methods on IU X-RAY, MIMIC-CXR, Retina ImBank, and Retina Chinese datasets.

Keywords: Medical report generation; Vision and language.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Benchmarking*
Language*
Learning
Medical Records
Records