Grounding spatial relations in text-only language models

Gorka Azkune; Ander Salaberria; Eneko Agirre

doi:10.1016/j.neunet.2023.11.031

Grounding spatial relations in text-only language models

Neural Netw. 2024 Feb:170:215-226. doi: 10.1016/j.neunet.2023.11.031. Epub 2023 Nov 17.

Authors

Gorka Azkune¹, Ander Salaberria², Eneko Agirre³

Affiliations

¹ HiTZ Basque Center for Language Technologies - Ixa NLP Group, University of the Basque Country (UPV/EHU), M. Lardizabal 1, Donostia 20018, Basque Country, Spain. Electronic address: gorka.azcune@ehu.eus.
² HiTZ Basque Center for Language Technologies - Ixa NLP Group, University of the Basque Country (UPV/EHU), M. Lardizabal 1, Donostia 20018, Basque Country, Spain. Electronic address: ander.salaberria@ehu.eus.
³ HiTZ Basque Center for Language Technologies - Ixa NLP Group, University of the Basque Country (UPV/EHU), M. Lardizabal 1, Donostia 20018, Basque Country, Spain. Electronic address: e.agirre@ehu.eus.

PMID: 37992509
DOI: 10.1016/j.neunet.2023.11.031

Abstract

This paper shows that text-only Language Models (LM) can learn to ground spatial relations like left of or below if they are provided with explicit location information of objects and they are properly trained to leverage those locations. We perform experiments on a verbalized version of the Visual Spatial Reasoning (VSR) dataset, where images are coupled with textual statements which contain real or fake spatial relations between two objects of the image. We verbalize the images using an off-the-shelf object detector, adding location tokens to every object label to represent their bounding boxes in textual form. Given the small size of VSR, we do not observe any improvement when using locations, but pretraining the LM over a synthetic dataset automatically derived by us improves results significantly when using location tokens. We thus show that locations allow LMs to ground spatial relations, with our text-only LMs outperforming Vision-and-Language Models and setting the new state-of-the-art for the VSR dataset. Our analysis show that our text-only LMs can generalize beyond the relations seen in the synthetic dataset to some extent, learning also more useful information than that encoded in the spatial rules we used to create the synthetic dataset itself.

Keywords: Deep learning; Language models; Spatial grounding.

MeSH terms

Language*
Learning*
Problem Solving