English

Visual Spatial Reasoning

Computation and Language 2023-03-23 v3 Artificial Intelligence Computer Vision and Pattern Recognition

Abstract

Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.

Keywords

Cite

@article{arxiv.2205.00363,
  title  = {Visual Spatial Reasoning},
  author = {Fangyu Liu and Guy Emerson and Nigel Collier},
  journal= {arXiv preprint arXiv:2205.00363},
  year   = {2023}
}

Comments

TACL camera-ready version; code and data available at https://github.com/cambridgeltl/visual-spatial-reasoning