Related papers: Progressive Language-guided Visual Learning for Mu…

Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular…

Computer Vision and Pattern Recognition · Computer Science 2024-12-20 Quang-Hung Le , Long Hoang Dang , Ngan Le , Truyen Tran , Thao Minh Le

ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Ke Li , Ting Wang , Di Wang , Yongshan Zhu , Yiming Zhang , Tao Lei , Quan Wang

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and…

Computer Vision and Pattern Recognition · Computer Science 2022-11-23 Yuan Yao , Qianyu Chen , Ao Zhang , Wei Ji , Zhiyuan Liu , Tat-Seng Chua , Maosong Sun

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or…

Computer Vision and Pattern Recognition · Computer Science 2021-07-15 Muchen Li , Leonid Sigal

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings…

Computer Vision and Pattern Recognition · Computer Science 2019-08-20 Andrea Burns , Reuben Tan , Kate Saenko , Stan Sclaroff , Bryan A. Plummer

Multi-task Visual Grounding with Coarse-to-Fine Consistency Constraints

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Ming Dai , Jian Li , Jiedong Zhuang , Xian Zhang , Wankou Yang

Co-Training Vision Language Models for Remote Sensing Multi-task Learning

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to…

Computer Vision and Pattern Recognition · Computer Science 2026-01-12 Qingyun Li , Shuran Ma , Junwei Luo , Yi Yu , Yue Zhou , Fengxiang Wang , Xudong Lu , Xiaoxing Wang , Xin He , Yushi Chen , Xue Yang

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Pre-trained language models (PLMs) have played an increasing role in multimedia research. In terms of vision-language (VL) tasks, they often serve as a language encoder and still require an additional fusion network for VL reasoning,…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Shubin Huang , Qiong Wu , Yiyi Zhou , Weijie Chen , Rongsheng Zhang , Xiaoshuai Sun , Rongrong Ji

Latent Expression Generation for Referring Image Segmentation and Grounding

Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Seonghoon Yu , Junbeom Hong , Joonseok Lee , Jeany Son

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining. Although they perform well in many understanding downstream…

Computer Vision and Pattern Recognition · Computer Science 2021-12-16 Tianyi Liu , Zuxuan Wu , Wenhan Xiong , Jingjing Chen , Yu-Gang Jiang

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

ProgVLA: Progress-Aware Robot Manipulation Skill Learning

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by…

Robotics · Computer Science 2026-05-28 Seungsu Kim , Jinyoung Choi , Seungmin Baek , Jean-Michel Renders

A Parameter-Efficient Tuning Framework for Language-guided Object Grounding and Robot Grasping

The language-guided robot grasping task requires a robot agent to integrate multimodal information from both visual and linguistic inputs to predict actions for target-driven grasping. While recent approaches utilizing Multimodal Large…

Robotics · Computer Science 2025-02-10 Houjian Yu , Mingen Li , Alireza Rezazadeh , Yang Yang , Changhyun Choi

Enhancing Visual Grounding and Generalization: A Multi-Task Cycle Training Approach for Vision-Language Models

Visual grounding (VG) occupies a pivotal position in multi-modality vision-language models. In this study, we propose ViLaM, a large multi-modality model, that supports multi-tasks of VG using the cycle training strategy, with abundant…

Computer Vision and Pattern Recognition · Computer Science 2024-04-29 Xiaoyu Yang , Lijian Xu , Hao Sun , Hongsheng Li , Shaoting Zhang

Multitask Vision-Language Prompt Tuning

Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Sheng Shen , Shijia Yang , Tianjun Zhang , Bohan Zhai , Joseph E. Gonzalez , Kurt Keutzer , Trevor Darrell

Towards Visual Grounding: A Survey

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jiajun Deng , Zhengyuan Yang , Daqing Liu , Tianlang Chen , Wengang Zhou , Yanyong Zhang , Houqiang Li , Wanli Ouyang

Advancing Visual Large Language Model for Multi-granular Versatile Perception

Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Wentao Xiang , Haoxian Tan , Cong Wei , Yujie Zhong , Dengjie Li , Yujiu Yang

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Wenbin Tan , Jiawen Lin , Fangyong Wang , Yuan Xie , Yong Xie , Yachao Zhang , Yanyun Qu

GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing

The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Zilun Zhang , Haozhan Shen , Tiancheng Zhao , Bin Chen , Zian Guan , Yuhao Wang , Xu Jia , Yuxiang Cai , Yongheng Shang , Jianwei Yin