BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Masoud Monajatipoor; Mozhdeh Rouhsedaghat; Liunian Harold Li; C-C Jay Kuo; Aichi Chien; Kai-Wei Chang

doi:10.1007/978-3-031-16443-9_69

BERTHop: An Effective Vision-and-Language Model for Chest X-ray Disease Diagnosis

Med Image Comput Comput Assist Interv. 2022 Sep:13435:725-734. doi: 10.1007/978-3-031-16443-9_69. Epub 2022 Sep 16.

Authors

Masoud Monajatipoor¹, Mozhdeh Rouhsedaghat², Liunian Harold Li¹, C-C Jay Kuo², Aichi Chien³, Kai-Wei Chang¹

Affiliations

¹ Department of Computer Science, Samueli School of Engineering University of California, Los Angeles, CA, 90095, USA.
² Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90007, USA.
³ Department of Radiological Sciences, David Geffen School of Medicine at UCLA, University of California, Los Angeles, CA, 90095, USA.

Abstract

Vision-and-language (V&L) models take image and text as input and learn to capture the associations between them. These models can potentially deal with the tasks that involve understanding medical images along with their associated text. However, applying V&L models in the medical domain is challenging due to the expensiveness of data annotations and the requirements of domain knowledge. In this paper, we identify that the visual representation in general V&L models is not suitable for processing medical data. To overcome this limitation, we propose BERTHop, a transformer-based model based on PixelHop++ and VisualBERT for better capturing the associations between clinical notes and medical images. Experiments on the OpenI dataset, a commonly used thoracic disease diagnosis benchmark, show that BERTHop achieves an average Area Under the Curve (AUC) of 98.12% which is 1.62% higher than state-of-the-art while it is trained on a 9× smaller dataset.

Keywords: Computer-aided diagnosis; Transfer learning; Vision & language model.

Grants and funding

R01 HL152270/HL/NHLBI NIH HHS/United States