Related papers: Language-Guided Diffusion Model for Visual Groundi…

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Ming Dai , Lingfeng Yang , Yihao Xu , Zhenhua Feng , Wankou Yang

Towards GUI Agents: Vision-Language Diffusion Models for GUI Grounding

Autoregressive (AR) vision-language models (VLMs) have long dominated multimodal understanding, reasoning, and graphical user interface (GUI) grounding. Recently, discrete diffusion vision-language models (DVLMs) have shown strong…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Shrinidhi Kumbhar , Haofu Liao , Srikar Appalaraju , Kunwar Yashraj Singh

Exploring Iterative Refinement with Diffusion Models for Video Grounding

Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. Existing methods typically select the best prediction from a set of predefined proposals or directly regress the target span…

Computer Vision and Pattern Recognition · Computer Science 2024-01-01 Xiao Liang , Tao Shi , Yaoyuan Liang , Te Tao , Shao-Lun Huang

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods,…

Computer Vision and Pattern Recognition · Computer Science 2022-01-17 Jiajun Deng , Zhengyuan Yang , Tianlang Chen , Wengang Zhou , Houqiang Li

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jiajun Deng , Zhengyuan Yang , Daqing Liu , Tianlang Chen , Wengang Zhou , Yanyong Zhang , Houqiang Li , Wanli Ouyang

ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Ke Li , Ting Wang , Di Wang , Yongshan Zhu , Yiming Zhang , Tao Lei , Quan Wang

VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive…

Computer Vision and Pattern Recognition · Computer Science 2024-01-24 Xuyang Liu , Siteng Huang , Yachen Kang , Honggang Chen , Donglin Wang

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities…

Computer Vision and Pattern Recognition · Computer Science 2023-07-24 Zhihong Chen , Ruifei Zhang , Yibing Song , Xiang Wan , Guanbin Li

Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

Diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive (AR) LLMs. Recently, this paradigm has been extended to multimodal tasks, leading to the development of diffusion multimodal large language…

Artificial Intelligence · Computer Science 2026-04-08 Keuntae Kim , Mingyu Kang , Yong Suk Choi

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Diffusion Model as a Generalist Segmentation Learner

Diffusion models are primarily trained for image synthesis, yet their denoising trajectories encode rich, spatially aligned visual priors. In this paper, we demonstrate that these priors can be utilized for text-conditioned semantic and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Haoxiao Wang , Antao Xiang , Haiyang Sun , Peilin Sun , Changhao Pan , Yifu Chen , Minjie Hong , Weijie Wang , Shuang Chen , Yue Chen , Zhou Zhao

ViGiL3D: A Linguistically Diverse Dataset for 3D Visual Grounding

3D visual grounding (3DVG) involves localizing entities in a 3D scene referred to by natural language text. Such models are useful for embodied AI and scene retrieval applications, which involve searching for objects or patterns using…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Austin T. Wang , ZeMing Gong , Angel X. Chang

Image Difference Grounding with Natural Language

Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Wenxuan Wang , Zijia Zhao , Yisi Zhang , Yepeng Tang , Erdong Hu , Xinlong Wang , Jing Liu

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Jiaxi Wang , Wenhui Hu , Xueyang Liu , Beihu Wu , Yuting Qiu , YingYing Cai

Textual and Visual Prompt Fusion for Image Editing via Step-Wise Alignment

The use of denoising diffusion models is becoming increasingly popular in the field of image editing. However, current approaches often rely on either image-guided methods, which provide a visual reference but lack control over semantic…

Computer Vision and Pattern Recognition · Computer Science 2025-01-07 Zhanbo Feng , Zenan Ling , Xinyu Lu , Ci Gong , Feng Zhou , Wugedele Bao , Jie Li , Fan Yang , Robert C. Qiu

OV-VG: A Benchmark for Open-Vocabulary Visual Grounding

Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed…

Computer Vision and Pattern Recognition · Computer Science 2023-10-24 Chunlei Wang , Wenquan Feng , Xiangtai Li , Guangliang Cheng , Shuchang Lyu , Binghao Liu , Lijiang Chen , Qi Zhao

VGent: Visual Grounding via Modular Design for Disentangling Reasoning and Prediction

Current visual grounding models are either based on a Multimodal Large Language Model (MLLM) that performs auto-regressive decoding, which is slow and risks hallucinations, or on re-aligning an LLM with vision features to learn new special…

Computer Vision and Pattern Recognition · Computer Science 2025-12-15 Weitai Kang , Jason Kuen , Mengwei Ren , Zijun Wei , Yan Yan , Kangning Liu

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a more useful technique than object detection in practice. Most…

Computer Vision and Pattern Recognition · Computer Science 2019-03-19 Chaorui Deng , Qi Wu , Guanghui Xu , Zhuliang Yu , Yanwu Xu , Kui Jia , Mingkui Tan

Unleashing Text-to-Image Diffusion Models for Visual Perception

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly…

Computer Vision and Pattern Recognition · Computer Science 2023-03-06 Wenliang Zhao , Yongming Rao , Zuyan Liu , Benlin Liu , Jie Zhou , Jiwen Lu