Related papers: An Efficient and Effective Transformer Decoder-Bas…

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jiajun Deng , Zhengyuan Yang , Daqing Liu , Tianlang Chen , Wengang Zhou , Yanyong Zhang , Houqiang Li , Wanli Ouyang

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods,…

Computer Vision and Pattern Recognition · Computer Science 2022-01-17 Jiajun Deng , Zhengyuan Yang , Tianlang Chen , Wengang Zhou , Houqiang Li

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Ming Dai , Lingfeng Yang , Yihao Xu , Zhenhua Feng , Wankou Yang

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Visual Grounding with Transformers

In this paper, we propose a transformer based approach for visual grounding. Unlike previous proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Ye Du , Zehua Fu , Qingjie Liu , Yunhong Wang

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention…

Computer Vision and Pattern Recognition · Computer Science 2023-10-27 Fengyuan Shi , Ruopeng Gao , Weilin Huang , Limin Wang

Referring Transformer: A One-step Approach to Multi-task Visual Grounding

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or…

Computer Vision and Pattern Recognition · Computer Science 2021-07-15 Muchen Li , Leonid Sigal

Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly…

Computation and Language · Computer Science 2021-09-20 Feilong Chen , Fandong Meng , Xiuyi Chen , Peng Li , Jie Zhou

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Manyi Yao , Abhishek Aich , Yumin Suh , Amit Roy-Chowdhury , Christian Shelton , Manmohan Chandraker

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Zipeng Wang , Dan Xu

Visual Grounding with Attention-Driven Constraint Balancing

Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Weitai Kang , Luowei Zhou , Junyi Wu , Changchang Sun , Yan Yan

LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Feng Xiao , Hongbin Xu , Guocan Zhao , Wenxiong Kang

SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Weitai Kang , Gaowen Liu , Mubarak Shah , Yan Yan

GPT-4 Enhanced Multimodal Grounding for Autonomous Driving: Leveraging Cross-Modal Attention with Large Language Models

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Haicheng Liao , Huanming Shen , Zhenning Li , Chengyue Wang , Guofa Li , Yiming Bie , Chengzhong Xu

Dynamic Grained Encoder for Vision Transformers

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save…

Computer Vision and Pattern Recognition · Computer Science 2023-01-11 Lin Song , Songyang Zhang , Songtao Liu , Zeming Li , Xuming He , Hongbin Sun , Jian Sun , Nanning Zheng

Bridging Modality Gap for Visual Grounding with Effecitve Cross-modal Distillation

Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Jiaxi Wang , Wenhui Hu , Xueyang Liu , Beihu Wu , Yuting Qiu , YingYing Cai

Multi-View Transformer for 3D Visual Grounding

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The…

Computer Vision and Pattern Recognition · Computer Science 2022-04-06 Shijia Huang , Yilun Chen , Jiaya Jia , Liwei Wang

EfficienT-HDR: An Efficient Transformer-Based Framework via Multi-Exposure Fusion for HDR Reconstruction

Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous…

Computer Vision and Pattern Recognition · Computer Science 2025-09-25 Yu-Shen Huang , Tzu-Han Chen , Cheng-Yen Hsiao , Shaou-Gang Miaou

A Decoding Scheme with Successive Aggregation of Multi-Level Features for Light-Weight Semantic Segmentation

Multi-scale architecture, including hierarchical vision transformer, has been commonly applied to high-resolution semantic segmentation to deal with computational complexity with minimum performance loss. In this paper, we propose a novel…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Jiwon Yoo , Jangwon Lee , Gyeonghwan Kim