Related papers: Grounded Semantic Composition for Visual Scenes

Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one…

Computation and Language · Computer Science 2020-02-10 Patrick Bordes , Eloi Zablocki , Laure Soulier , Benjamin Piwowarski , Patrick Gallinari

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to…

Computation and Language · Computer Science 2017-11-10 Éloi Zablocki , Benjamin Piwowarski , Laure Soulier , Patrick Gallinari

Language with Vision: a Study on Grounded Word and Sentence Embeddings

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many…

Computation and Language · Computer Science 2023-11-01 Hassan Shahmohammadi , Maria Heitmeier , Elnaz Shafaei-Bajestan , Hendrik P. A. Lensch , Harald Baayen

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge

Distributional semantic models capture word-level meaning that is useful in many natural language processing tasks and have even been shown to capture cognitive aspects of word meaning. The majority of these models are purely text based,…

Computation and Language · Computer Science 2022-03-31 Danny Merkx , Stefan L. Frank , Mirjam Ernestus

Semantic Composition in Visually Grounded Language Models

What is sentence meaning and its ideal representation? Much of the expressive power of human language derives from semantic composition, the mind's ability to represent meaning hierarchically & relationally over constituents. At the same…

Computation and Language · Computer Science 2023-05-29 Rohan Pandey

Neural Variational Learning for Grounded Language Acquisition

We propose a learning system in which language is grounded in visual percepts without specific pre-defined categories of terms. We present a unified generative method to acquire a shared semantic/visual embedding that enables the learning…

Computation and Language · Computer Science 2021-08-02 Nisha Pillai , Cynthia Matuszek , Francis Ferraro

Towards Visual Grounding: A Survey

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic…

Computer Vision and Pattern Recognition · Computer Science 2019-11-26 Yongfei Liu , Bo Wan , Xiaodan Zhu , Xuming He

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

We propose a model to learn visually grounded word embeddings (vis-w2v) to capture visual notions of semantic relatedness. While word embeddings trained using text have been extremely successful, they cannot uncover notions of semantic…

Computer Vision and Pattern Recognition · Computer Science 2016-06-30 Satwik Kottur , Ramakrishna Vedantam , José M. F. Moura , Devi Parikh

Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding

We address the problem of phrase grounding by lear ing a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as…

Computer Vision and Pattern Recognition · Computer Science 2019-05-31 Hassan Akbari , Svebor Karaman , Surabhi Bhargava , Brian Chen , Carl Vondrick , Shih-Fu Chang

Learning Language Structures through Grounding

Language is highly structured, with syntactic and semantic structures, to some extent, agreed upon by speakers of the same language. With implicit or explicit awareness of such structures, humans can learn and use language efficiently and…

Computation and Language · Computer Science 2024-10-23 Freda Shi

Visually grounded models of spoken language: A survey of datasets, architectures and evaluation techniques

This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of…

Artificial Intelligence · Computer Science 2022-02-22 Grzegorz Chrupała

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the distributional semantics but fail to connect to any knowledge about the physical world. In contrast,…

Computation and Language · Computer Science 2021-11-16 Yizhen Zhang , Minkyu Choi , Kuan Han , Zhongming Liu

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

The human language is one of the most natural interfaces for humans to interact with robots. This paper presents a robot system that retrieves everyday objects with unconstrained natural language descriptions. A core issue for the system is…

Robotics · Computer Science 2017-07-19 Mohit Shridhar , David Hsu

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering.…

Computation and Language · Computer Science 2025-11-11 Akshar Tumu , Varad Shinde , Parisa Kordjamshidi

Probing Contextual Language Models for Common Ground with Visual Representations

The success of large-scale contextual language models has attracted great interest in probing what is encoded in their representations. In this work, we consider a new question: to what extent contextual representations of concrete nouns…

Computation and Language · Computer Science 2021-04-14 Gabriel Ilharco , Rowan Zellers , Ali Farhadi , Hannaneh Hajishirzi

Representations of language in a model of visually grounded speech signal

We present a visually grounded model of speech perception which projects spoken utterances and images to a joint semantic space. We use a multi-layer recurrent highway network to model the temporal nature of spoken speech, and show that it…

Computation and Language · Computer Science 2018-10-30 Grzegorz Chrupała , Lieke Gelderloos , Afra Alishahi

Discovering Meaningful Units with Visually Grounded Semantics from Image Captions

Fine-grained knowledge is crucial for vision-language models to obtain a better understanding of the real world. While there has been work trying to acquire this kind of knowledge in the space of vision and language, it has mostly focused…

Computer Vision and Pattern Recognition · Computer Science 2025-11-17 Melika Behjati , James Henderson