Related papers: PD-APE: A Parallel Decoding Framework with Adaptiv…

LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Feng Xiao , Hongbin Xu , Guocan Zhao , Wenxiong Kang

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus…

Computer Vision and Pattern Recognition · Computer Science 2023-10-09 Yanmin Wu , Xinhua Cheng , Renrui Zhang , Zesen Cheng , Jian Zhang

Aligning and Prompting Everything All at Once for Universal Visual Perception

Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which…

Computer Vision and Pattern Recognition · Computer Science 2023-12-05 Yunhang Shen , Chaoyou Fu , Peixian Chen , Mengdan Zhang , Ke Li , Xing Sun , Yunsheng Wu , Shaohui Lin , Rongrong Ji

A 2D Semantic-Aware Position Encoding for Vision Transformers

Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Xi Chen , Shiyang Zhou , Muqi Huang , Jiaxu Feng , Yun Xiong , Kun Zhou , Biao Yang , Yuhui Zhang , Huishuai Bao , Sijia Peng , Chuan Li , Feng Shi

Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Christoffer Koo Øhrstrøm , Rafael I. Cabral Muchacho , Yifei Dong , Filippos Moumtzidellis , Ronja Güldenring , Florian T. Pokorny , Lazaros Nalpantidis

CAPE: Camera View Position Embedding for Multi-View 3D Object Detection

In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Kaixin Xiong , Shi Gong , Xiaoqing Ye , Xiao Tan , Ji Wan , Errui Ding , Jingdong Wang , Xiang Bai

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Yue Xu , Kaizhi Yang , Jiebo Luo , Xuejin Chen

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training…

Computer Vision and Pattern Recognition · Computer Science 2023-08-09 Yichao Shen , Zigang Geng , Yuhui Yuan , Yutong Lin , Ze Liu , Chunyu Wang , Han Hu , Nanning Zheng , Baining Guo

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. Typically, the sentences describing the target object tend to provide information about its relative relation between other…

Computer Vision and Pattern Recognition · Computer Science 2023-07-26 Zehan Wang , Haifeng Huang , Yang Zhao , Linjun Li , Xize Cheng , Yichen Zhu , Aoxiong Yin , Zhou Zhao

3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers

Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Changyong Shu , JIajun Deng , Fisher Yu , Yifan Liu

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Shihao Wang , Shilong Liu , Yuanguo Kuang , Xinyu Wei , Yangzhou Liu , Zhiqi Li , Yunze Man , Guo Chen , Andrew Tao , Guilin Liu , Jan Kautz , Lei Zhang , Zhiding Yu

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Feng Xiao , Hongbin Xu , Qiuxia Wu , Wenxiong Kang

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Wenbin Tan , Jiawen Lin , Fangyong Wang , Yuan Xie , Yong Xie , Yachao Zhang , Yanyun Qu

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation

Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Ziyu Zhu , Xilin Wang , Yixuan Li , Zhuofan Zhang , Xiaojian Ma , Yixin Chen , Baoxiong Jia , Wei Liang , Qian Yu , Zhidong Deng , Siyuan Huang , Qing Li

Remember to Forget: Gated Adaptive Positional Encoding

Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious…

Machine Learning · Computer Science 2026-05-12 Riccardo Ali , Alessio Borgi , Christopher Irwin , Mario Severino , Pietro Liò

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding

Inspired by the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary Position Encoding…

Computation and Language · Computer Science 2024-06-17 Xindian Ma , Wenyuan Liu , Peng Zhang , Nan Xu

3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges:…

Computer Vision and Pattern Recognition · Computer Science 2025-05-06 Xiaoqi Li , Jiaming Liu , Nuowei Han , Liang Heng , Yandong Guo , Hao Dong , Yang Liu

Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target…

Computer Vision and Pattern Recognition · Computer Science 2023-05-30 Zhihao Yuan , Xu Yan , Zhuo Li , Xuhao Li , Yao Guo , Shuguang Cui , Zhen Li

DVPE: Divided View Position Embedding for Multi-View 3D Object Detection

Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating…

Computer Vision and Pattern Recognition · Computer Science 2024-07-25 Jiasen Wang , Zhenglin Li , Ke Sun , Xianyuan Liu , Yang Zhou