English
Related papers

Related papers: Progressive Language-guided Visual Learning for Mu…

200 papers

Existing Large Vision-Language Models (LVLMs) excel at matching concepts across multi-modal inputs but struggle with compositional concepts and high-level relationships between entities. This paper introduces Progressive multi-granular…

Computer Vision and Pattern Recognition · Computer Science 2024-12-20 Quang-Hung Le , Long Hoang Dang , Ngan Le , Truyen Tran , Thao Minh Le

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Ke Li , Ting Wang , Di Wang , Yongshan Zhu , Yiming Zhang , Tao Lei , Quan Wang

Vision-language pre-training (VLP) has shown impressive performance on a wide range of cross-modal tasks, where VLP models without reliance on object detectors are becoming the mainstream due to their superior computation efficiency and…

Computer Vision and Pattern Recognition · Computer Science 2022-11-23 Yuan Yao , Qianyu Chen , Ao Zhang , Wei Ji , Zhiyuan Liu , Tat-Seng Chua , Maosong Sun

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or…

Computer Vision and Pattern Recognition · Computer Science 2021-07-15 Muchen Li , Leonid Sigal

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings…

Computer Vision and Pattern Recognition · Computer Science 2019-08-20 Andrea Burns , Reuben Tan , Kate Saenko , Stan Sclaroff , Bryan A. Plummer

Multi-task visual grounding involves the simultaneous execution of localization and segmentation in images based on textual expressions. The majority of advanced methods predominantly focus on transformer-based multimodal fusion, aiming to…

Computer Vision and Pattern Recognition · Computer Science 2025-01-14 Ming Dai , Jian Li , Jiedong Zhuang , Xian Zhang , Wankou Yang

With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to…

Computer Vision and Pattern Recognition · Computer Science 2026-01-12 Qingyun Li , Shuran Ma , Junwei Luo , Yi Yu , Yue Zhou , Fengxiang Wang , Xudong Lu , Xiaoxing Wang , Xin He , Yushi Chen , Xue Yang

Pre-trained language models (PLMs) have played an increasing role in multimedia research. In terms of vision-language (VL) tasks, they often serve as a language encoder and still require an additional fusion network for VL reasoning,…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Shubin Huang , Qiong Wu , Yiyi Zhou , Weijie Chen , Rongsheng Zhang , Xiaoshuai Sun , Rongrong Ji

Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Seonghoon Yu , Junbeom Hong , Joonseok Lee , Jeany Son

Most existing vision-language pre-training methods focus on understanding tasks and use BERT-like objectives (masked language modeling and image-text matching) during pretraining. Although they perform well in many understanding downstream…

Computer Vision and Pattern Recognition · Computer Science 2021-12-16 Tianyi Liu , Zuxuan Wu , Wenhan Xiong , Jingjing Chen , Yu-Gang Jiang

Visual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Minghang Zheng , Jiahua Zhang , Qingchao Chen , Yuxin Peng , Yang Liu

We present ProgVLA, a compact vision-language-action (VLA) model designed for reliable robot manipulation under tight compute and memory budgets. The model specifically focuses on efficiently processing long multi-modal sequences by…

Robotics · Computer Science 2026-05-28 Seungsu Kim , Jinyoung Choi , Seungmin Baek , Jean-Michel Renders

The language-guided robot grasping task requires a robot agent to integrate multimodal information from both visual and linguistic inputs to predict actions for target-driven grasping. While recent approaches utilizing Multimodal Large…

Robotics · Computer Science 2025-02-10 Houjian Yu , Mingen Li , Alireza Rezazadeh , Yang Yang , Changhyun Choi

Visual grounding (VG) occupies a pivotal position in multi-modality vision-language models. In this study, we propose ViLaM, a large multi-modality model, that supports multi-tasks of VG using the cycle training strategy, with abundant…

Computer Vision and Pattern Recognition · Computer Science 2024-04-29 Xiaoyu Yang , Lijian Xu , Hao Sun , Hongsheng Li , Shaoting Zhang

Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Sheng Shen , Shijia Yang , Tianjun Zhang , Bohan Zhai , Joseph E. Gonzalez , Kurt Keutzer , Trevor Darrell

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jiajun Deng , Zhengyuan Yang , Daqing Liu , Tianlang Chen , Wengang Zhou , Yanyong Zhang , Houqiang Li , Wanli Ouyang

Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two dimensions: prediction type and instruction type.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-23 Wentao Xiang , Haoxian Tan , Cong Wei , Yujie Zhong , Dengjie Li , Yujiu Yang

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Wenbin Tan , Jiawen Lin , Fangyong Wang , Yuan Xie , Yong Xie , Yachao Zhang , Yanyun Qu

The application of Vision-Language Models (VLMs) in remote sensing (RS) has demonstrated significant potential in traditional tasks such as scene classification, object detection, and image captioning. However, current models, which excel…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Zilun Zhang , Haozhan Shen , Tiancheng Zhao , Bin Chen , Zian Guan , Yuhao Wang , Xu Jia , Yuxiang Cai , Yongheng Shang , Jianwei Yin
‹ Prev 1 2 3 10 Next ›