English
Related papers

Related papers: An Efficient and Effective Transformer Decoder-Bas…

200 papers

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Jiajun Deng , Zhengyuan Yang , Daqing Liu , Tianlang Chen , Wengang Zhou , Yanyong Zhang , Houqiang Li , Wanli Ouyang

In this paper, we present a neat yet effective transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to the corresponding region onto an image. The state-of-the-art methods,…

Computer Vision and Pattern Recognition · Computer Science 2022-01-17 Jiajun Deng , Zhengyuan Yang , Tianlang Chen , Wengang Zhou , Houqiang Li

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Ming Dai , Lingfeng Yang , Yihao Xu , Zhenhua Feng , Wankou Yang

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

In this paper, we propose a transformer based approach for visual grounding. Unlike previous proposal-and-rank frameworks that rely heavily on pretrained object detectors or proposal-free frameworks that upgrade an off-the-shelf one-stage…

Computer Vision and Pattern Recognition · Computer Science 2022-03-15 Ye Du , Zehua Fu , Qingjie Liu , Yunhong Wang

Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding. However, the existing encoder-only grounding framework (e.g., TransVG) suffers from heavy computation due to the self-attention…

Computer Vision and Pattern Recognition · Computer Science 2023-10-27 Fengyuan Shi , Ruopeng Gao , Weilin Huang , Limin Wang

As an important step towards visual reasoning, visual grounding (e.g., phrase localization, referring expression comprehension/segmentation) has been widely explored Previous approaches to referring expression comprehension (REC) or…

Computer Vision and Pattern Recognition · Computer Science 2021-07-15 Muchen Li , Leonid Sigal

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly…

Computation and Language · Computer Science 2021-09-20 Feilong Chen , Fandong Meng , Xiuyi Chen , Peng Li , Jie Zhou

Referring image segmentation is a fundamental vision-language task that aims to segment out an object referred to by a natural language expression from an image. One of the key challenges behind this task is leveraging the referring…

Computer Vision and Pattern Recognition · Computer Science 2022-04-07 Zhao Yang , Jiaqi Wang , Yansong Tang , Kai Chen , Hengshuang Zhao , Philip H. S. Torr

Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Manyi Yao , Abhishek Aich , Yumin Suh , Amit Roy-Chowdhury , Christian Shelton , Manmohan Chandraker

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Zipeng Wang , Dan Xu

Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Weitai Kang , Luowei Zhou , Junyi Wu , Changchang Sun , Yan Yan

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Feng Xiao , Hongbin Xu , Guocan Zhao , Wenxiong Kang

Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Weitai Kang , Gaowen Liu , Mubarak Shah , Yan Yan

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Haicheng Liao , Huanming Shen , Zhenning Li , Chengyue Wang , Guofa Li , Yiming Bie , Chengzhong Xu

Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save…

Computer Vision and Pattern Recognition · Computer Science 2023-01-11 Lin Song , Songyang Zhang , Songtao Liu , Zeming Li , Xuming He , Hongbin Sun , Jian Sun , Nanning Zheng

Visual grounding aims to align visual information of specific regions of images with corresponding natural language expressions. Current visual grounding methods leverage pre-trained visual and language backbones independently to obtain…

Computer Vision and Pattern Recognition · Computer Science 2024-07-09 Jiaxi Wang , Wenhui Hu , Xueyang Liu , Beihu Wu , Yuting Qiu , YingYing Cai

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The…

Computer Vision and Pattern Recognition · Computer Science 2022-04-06 Shijia Huang , Yilun Chen , Jiaya Jia , Liwei Wang

Achieving high-quality High Dynamic Range (HDR) imaging on resource-constrained edge devices is a critical challenge in computer vision, as its performance directly impacts downstream tasks such as intelligent surveillance and autonomous…

Computer Vision and Pattern Recognition · Computer Science 2025-09-25 Yu-Shen Huang , Tzu-Han Chen , Cheng-Yen Hsiao , Shaou-Gang Miaou

Multi-scale architecture, including hierarchical vision transformer, has been commonly applied to high-resolution semantic segmentation to deal with computational complexity with minimum performance loss. In this paper, we propose a novel…

Computer Vision and Pattern Recognition · Computer Science 2024-06-17 Jiwon Yoo , Jangwon Lee , Gyeonghwan Kim
‹ Prev 1 2 3 10 Next ›