English

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Computer Vision and Pattern Recognition 2025-10-07 v1 Artificial Intelligence Machine Learning

Abstract

Vision-language models (VLMs) have advanced multimodal reasoning but still face challenges in spatial reasoning for 3D scenes and complex object configurations. To address this, we introduce SpatialViLT, an enhanced VLM that integrates spatial features like depth maps, 3D coordinates, and edge maps through a multi-task learning framework. This approach enriches multimodal embeddings with spatial understanding. We propose two variants: SpatialViLT and MaskedSpatialViLT, focusing on full and masked object regions, respectively. Additionally, SpatialEnsemble combines both approaches, achieving state-of-the-art accuracy. Our models excel in spatial reasoning categories such as directional, topological, and proximity relations, as demonstrated on the challenging Visual Spatial Reasoning (VSR) dataset. This work represents a significant step in enhancing the spatial intelligence of AI systems, crucial for advanced multimodal understanding and real-world applications.

Keywords

Cite

@article{arxiv.2510.03441,
  title  = {Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning},
  author = {Chashi Mahiul Islam and Oteo Mamo and Samuel Jacob Chacko and Xiuwen Liu and Weikuan Yu},
  journal= {arXiv preprint arXiv:2510.03441},
  year   = {2025}
}

Comments

12 pages, 5 figures

R2 v1 2026-07-01T06:16:10.652Z