Related papers: Multimodal Reference Visual Grounding

RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

In this paper, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS…

Computer Vision and Pattern Recognition · Computer Science 2023-05-03 Yang Zhan , Zhitong Xiong , Yuan Yuan

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by…

Computer Vision and Pattern Recognition · Computer Science 2026-01-09 Shurong Zheng , Yousong Zhu , Hongyin Zhao , Fan Yang , Yufei Zhan , Ming Tang , Jinqiao Wang

Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding

Visual grounding seeks to localize the image region corresponding to a free-form text description. Recently, the strong multimodal capabilities of Large Vision-Language Models (LVLMs) have driven substantial improvements in visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Seil Kang , Jinyeong Kim , Junhyeok Kim , Seong Jae Hwang

AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG…

Computer Vision and Pattern Recognition · Computer Science 2025-10-09 Junli Liu , Qizhi Chen , Zhigang Wang , Yiwen Tang , Yiting Zhang , Chi Yan , Dong Wang , Xuelong Li , Bin Zhao

ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Ke Li , Ting Wang , Di Wang , Yongshan Zhu , Yiming Zhang , Tao Lei , Quan Wang

VGR: Visual Grounded Reasoning

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Jiacong Wang , Zijian Kang , Haochen Wang , Haiyong Jiang , Jiawen Li , Bohong Wu , Ya Wang , Jiao Ran , Xiao Liang , Chao Feng , Jun Xiao

GroundingME: Exposing the Visual Grounding Gap in MLLMs through Multi-Dimensional Evaluation

Visual grounding, localizing objects from natural language descriptions, represents a critical bridge between language and vision understanding. While multimodal large language models (MLLMs) achieve impressive scores on existing…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Rang Li , Lei Li , Shuhuai Ren , Hao Tian , Shuhao Gu , Shicheng Li , Zihao Yue , Yudong Wang , Wenhan Ma , Zhe Yang , Jingyuan Ma , Zhifang Sui , Fuli Luo

Towards Understanding Visual Grounding in Visual Language Models

Visual grounding refers to the ability of a model to identify a region within some visual input that matches a textual description. Consequently, a model equipped with visual grounding capabilities can target a wide range of applications in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Georgios Pantazopoulos , Eda B. Özyiğit

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Ming Dai , Lingfeng Yang , Yihao Xu , Zhenhua Feng , Wankou Yang

ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model

Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing…

Computer Vision and Pattern Recognition · Computer Science 2025-08-21 Weitai Kang , Weiming Zhuang , Zhizhong Li , Yan Yan , Lingjuan Lyu

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities…

Computer Vision and Pattern Recognition · Computer Science 2023-07-24 Zhihong Chen , Ruifei Zhang , Yibing Song , Xiang Wan , Guanbin Li

UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning

Traditional visual grounding methods primarily focus on single-image scenarios with simple textual references. However, extending these methods to real-world scenarios that involve implicit and complex instructions, particularly in…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Sule Bai , Mingxing Li , Yong Liu , Jing Tang , Haoji Zhang , Lei Sun , Xiangxiang Chu , Yansong Tang

RealVLG-R1: A Large-Scale Real-World Visual-Language Grounding Benchmark for Robotic Perception and Manipulation

Visual-language grounding aims to establish semantic correspondences between natural language and visual entities, enabling models to accurately identify and localize target objects based on textual instructions. Existing VLG approaches…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Linfei Li , Lin Zhang , Ying Shen

RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning

Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions. Owing to the vast spatial scale and high semantic ambiguity of remote sensing scenes, these…

Computer Vision and Pattern Recognition · Computer Science 2026-01-30 Shiqi Huang , Shuting He , Bihan Wen

Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models

Large vision-and-language models (VLMs) trained to match images with text on large-scale datasets of image-text pairs have shown impressive generalization ability on several vision and language tasks. Several recent works, however, showed…

Computer Vision and Pattern Recognition · Computer Science 2024-03-07 Navid Rajabi , Jana Kosecka

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Austin T. Wang , ZeMing Gong , Angel X. Chang

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Haocheng Li , Juepeng Zheng , Zenghao Yang , Kaiqi Du , Guilong Xiao , Gengmeng Pu , Haohuan Fu , Jianxi Huang

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Visual grounding is an essential tool that links user-provided text queries with query-specific regions within an image. Despite advancements in visual grounding models, their ability to comprehend complex queries remains limited. To…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Haoyu Zhao , Wenhang Ge , Ying-cong Chen

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

While multimodal large language models (MLLMs) have demonstrated extraordinary vision-language understanding capabilities, their abilities to solve instance-level visual-language problems beyond a single image warrant further exploration.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Yunqiu Xu , Linchao Zhu , Yi Yang