Transformers bridge vision and language to estimate and understand scene meaning

Res Sq [Preprint]. 2023 May 29:rs.3.rs-2968381. doi: 10.21203/rs.3.rs-2968381/v1.

Abstract

Humans rapidly process and understand real-world scenes with ease. Our stored semantic knowledge gained from experience is thought to be central to this ability by organizing perceptual information into meaningful units to efficiently guide our attention in scenes. However, the role stored semantic representations play in scene guidance remains difficult to study and poorly understood. Here, we apply a state-of-the-art multimodal transformer trained on billions of image-text pairs to help advance our understanding of the role semantic representations play in scene understanding. We demonstrate across multiple studies that this transformer-based approach can be used to automatically estimate local scene meaning in indoor and outdoor scenes, predict where people look in these scenes, detect changes in local semantic content, and provide a human-interpretable account of why one scene region is more meaningful than another. Taken together, these findings highlight how multimodal transformers can advance our understanding of the role scene semantics play in scene understanding by serving as a representational framework that bridges vision and language.

Keywords: deep learning; scene perception; semantics; transformer.

Publication types

  • Preprint