Related papers: CompGuessWhat?!: A Multi-task Evaluation Framework…

ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

Vision language action (VLA) models enable generalist robotic agents but often exhibit language ignorance, relying on visual shortcuts and remaining insensitive to instruction changes. We present Prospective Grounding and Alignment VLA…

Robotics · Computer Science 2026-04-14 Nastaran Darabi , Amit Ranjan Trivedi

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of a multilingual benchmark, however, vision-and-language research has mostly focused on English…

Computation and Language · Computer Science 2022-07-19 Emanuele Bugliarello , Fangyu Liu , Jonas Pfeiffer , Siva Reddy , Desmond Elliott , Edoardo Maria Ponti , Ivan Vulić

Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation

Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 Yibo Cui , Liang Xie , Yakun Zhang , Meishan Zhang , Ye Yan , Erwei Yin

Imagining Grounded Conceptual Representations from Perceptual Information in Situated Guessing Games

In visual guessing games, a Guesser has to identify a target object in a scene by asking questions to an Oracle. An effective strategy for the players is to learn conceptual representations of objects that are both discriminative and…

Computation and Language · Computer Science 2020-11-06 Alessandro Suglia , Antonio Vergari , Ioannis Konstas , Yonatan Bisk , Emanuele Bastianelli , Andrea Vanzo , Oliver Lemon

Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features,…

Computation and Language · Computer Science 2020-10-13 Alex Warstadt , Yian Zhang , Haau-Sing Li , Haokun Liu , Samuel R. Bowman

Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat

We propose a grounded dialogue state encoder which addresses a foundational issue on how to integrate visual grounding with dialogue system components. As a test-bed, we focus on the GuessWhat?! game, a two-player game where the goal is to…

Computation and Language · Computer Science 2019-03-18 Ravi Shekhar , Aashish Venkatesh , Tim Baumgärtner , Elia Bruni , Barbara Plank , Raffaella Bernardi , Raquel Fernández

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has…

Computation and Language · Computer Science 2023-05-25 Woojeong Jin , Subhabrata Mukherjee , Yu Cheng , Yelong Shen , Weizhu Chen , Ahmed Hassan Awadallah , Damien Jose , Xiang Ren

Learning Better Visual Dialog Agents with Pretrained Visual-Linguistic Representation

GuessWhat?! is a two-player visual dialog guessing game where player A asks a sequence of yes/no questions (Questioner) and makes a final guess (Guesser) about a target object in an image, based on answers from player B (Oracle). Based on…

Computer Vision and Pattern Recognition · Computer Science 2021-05-26 Tao Tu , Qing Ping , Govind Thattai , Gokhan Tur , Prem Natarajan

Describe me an Aucklet: Generating Grounded Perceptual Category Descriptions

Human speakers can generate descriptions of perceptual concepts, abstracted from the instance-level. Moreover, such descriptions can be used by other speakers to learn provisional representations of those concepts. Learning and using…

Computation and Language · Computer Science 2023-10-27 Bill Noble , Nikolai Ilinykh

Goal-Oriented Gaze Estimation for Zero-Shot Learning

Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen classes. Since semantic knowledge is built on attributes shared between different classes, which are highly local,…

Computer Vision and Pattern Recognition · Computer Science 2021-03-08 Yang Liu , Lei Zhou , Xiao Bai , Yifei Huang , Lin Gu , Jun Zhou , Tatsuya Harada

GELDA: A generative language annotation framework to reveal visual biases in datasets

Bias analysis is a crucial step in the process of creating fair datasets for training and evaluating computer vision models. The bottleneck in dataset analysis is annotation, which typically requires: (1) specifying a list of attributes…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Krish Kabra , Kathleen M. Lewis , Guha Balakrishnan

VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models

Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Ravi Ranjan , Agoritsa Polyzou

Grounded Language-Image Pre-training

This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The…

Computer Vision and Pattern Recognition · Computer Science 2022-06-20 Liunian Harold Li , Pengchuan Zhang , Haotian Zhang , Jianwei Yang , Chunyuan Li , Yiwu Zhong , Lijuan Wang , Lu Yuan , Lei Zhang , Jenq-Neng Hwang , Kai-Wei Chang , Jianfeng Gao

DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-25 Wenhao Li , Xianjing Meng , Qiangchang Wang , Zhongyi Han , Zhibin Wu , Yilong Yin

Stacked Semantic-Guided Attention Model for Fine-Grained Zero-Shot Learning

Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships between the global image feature vector and the corresponding class semantic descriptions. However, using the global features to represent fine-grained images may…

Computer Vision and Pattern Recognition · Computer Science 2018-05-22 Yunlong Yu , Zhong Ji , Yanwei Fu , Jichang Guo , Yanwei Pang , Zhongfei Zhang

DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning

Zero-shot learning (ZSL) aims to predict unseen classes whose samples have never appeared during training. One of the most effective and widely used semantic information for zero-shot image classification are attributes which are…

Computer Vision and Pattern Recognition · Computer Science 2023-02-17 Zhuo Chen , Yufeng Huang , Jiaoyan Chen , Yuxia Geng , Wen Zhang , Yin Fang , Jeff Z. Pan , Huajun Chen

Grounded Object Centric Learning

The extraction of modular object-centric representations for downstream tasks is an emerging area of research. Learning grounded representations of objects that are guaranteed to be stable and invariant promises robust performance across…

Machine Learning · Computer Science 2024-01-26 Avinash Kori , Francesco Locatello , Fabio De Sousa Ribeiro , Francesca Toni , Ben Glocker

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training. Previous research has focused on aligning sequences' visual and semantic spatial…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Haojun Xu , Yan Gao , Jie Li , Xinbo Gao

Visually Grounded Language Learning: a review of language games, datasets, tasks, and models

In recent years, several machine learning models have been proposed. They are trained with a language modelling objective on large-scale text-only data. With such pretraining, they can achieve impressive results on many Natural Language…

Computation and Language · Computer Science 2023-12-06 Alessandro Suglia , Ioannis Konstas , Oliver Lemon

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts, which requires: (i) a fine-grained disentanglement of complex visual scene and textual context, and (ii) a…

Computer Vision and Pattern Recognition · Computer Science 2024-04-10 Zeyu Han , Fangrui Zhu , Qianru Lao , Huaizu Jiang