Multimodal Representation Learning via Maximization of Local Mutual Information

Ruizhi Liao; Daniel Moyer; Miriam Cha; Keegan Quigley; Seth Berkowitz; Steven Horng; Polina Golland; William M Wells

doi:10.1007/978-3-030-87196-3_26

Multimodal Representation Learning via Maximization of Local Mutual Information

Med Image Comput Comput Assist Interv. 2021 Sep-Oct:12902:273-283. doi: 10.1007/978-3-030-87196-3_26. Epub 2021 Sep 21.

Authors

Ruizhi Liao¹, Daniel Moyer¹, Miriam Cha², Keegan Quigley², Seth Berkowitz³, Steven Horng³, Polina Golland¹, William M Wells^{1

4}

Affiliations

¹ CSAIL, Massachusetts Institute of Technology, Cambridge, MA, USA.
² MIT Lincoln Laboratory, Lexington, MA, USA.
³ Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA.
⁴ Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.

Abstract

We propose and demonstrate a representation learning approach by maximizing the mutual information between local features of images and text. The goal of this approach is to learn useful image representations by taking advantage of the rich information contained in the free text that describes the findings in the image. Our method trains image and text encoders by encouraging the resulting representations to exhibit high local mutual information. We make use of recent advances in mutual information estimation with neural network discriminators. We argue that the sum of local mutual information is typically a lower bound on the global mutual information. Our experimental results in the downstream image classification tasks demonstrate the advantages of using local features for image-text representation learning. Our code is available at: https://github.com/RayRuizhiLiao/mutual_info_img_txt.

Keywords: Local feature representations; Multimodal representation learning; Mutual information maximization.

Grants and funding

P41 EB028741/EB/NIBIB NIH HHS/United States