English
Related papers

Related papers: PD-APE: A Parallel Decoding Framework with Adaptiv…

200 papers

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Feng Xiao , Hongbin Xu , Guocan Zhao , Wenxiong Kang

3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus…

Computer Vision and Pattern Recognition · Computer Science 2023-10-09 Yanmin Wu , Xinhua Cheng , Renrui Zhang , Zesen Cheng , Jian Zhang

Vision foundation models have been explored recently to build general-purpose vision systems. However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which…

Computer Vision and Pattern Recognition · Computer Science 2023-12-05 Yunhang Shen , Chaoyou Fu , Peixian Chen , Mengdan Zhang , Ke Li , Xing Sun , Yunsheng Wu , Shaohui Lin , Rongrong Ji

Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Xi Chen , Shiyang Zhou , Muqi Huang , Jiaxu Feng , Yun Xiong , Kun Zhou , Biao Yang , Yuhui Zhang , Huishuai Bao , Sijia Peng , Chuan Li , Feng Shi

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our…

In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Kaixin Xiong , Shi Gong , Xiaoqing Ye , Xiao Tan , Ji Wan , Errui Ding , Jingdong Wang , Xiang Bai

3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Yue Xu , Kaizhi Yang , Jiebo Luo , Xuejin Chen

We introduce a highly performant 3D object detector for point clouds using the DETR framework. The prior attempts all end up with suboptimal results because they fail to learn accurate inductive biases from the limited scale of training…

Computer Vision and Pattern Recognition · Computer Science 2023-08-09 Yichao Shen , Zigang Geng , Yuhui Yuan , Yutong Lin , Ze Liu , Chunyu Wang , Han Hu , Nanning Zheng , Baining Guo

3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description. Typically, the sentences describing the target object tend to provide information about its relative relation between other…

Computer Vision and Pattern Recognition · Computer Science 2023-07-26 Zehan Wang , Haifeng Huang , Yang Zhao , Linjun Li , Xize Cheng , Yichen Zhu , Aoxiong Yin , Zhou Zhao

Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Changyong Shu , JIajun Deng , Fisher Yu , Yifan Liu

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Shihao Wang , Shilong Liu , Yuanguo Kuang , Xinyu Wei , Yangzhou Liu , Zhiqi Li , Yunze Man , Guo Chen , Andrew Tao , Guilin Liu , Jan Kautz , Lei Zhang , Zhiding Yu

3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Feng Xiao , Hongbin Xu , Qiuxia Wu , Wenxiong Kang

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Wenbin Tan , Jiawen Lin , Fangyong Wang , Yuan Xie , Yong Xie , Yachao Zhang , Yanyun Qu

Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus…

Computer Vision and Pattern Recognition · Computer Science 2025-07-31 Ziyu Zhu , Xilin Wang , Yixuan Li , Zhuofan Zhang , Xiaojian Ma , Yixin Chen , Baoxiong Jia , Wei Liang , Qian Yu , Zhidong Deng , Siyuan Huang , Qing Li

Rotary Positional Encoding (RoPE) is widely used in modern large language models. However, when sequences are extended beyond the range seen during training, rotary phases can enter out-of-distribution regimes, leading to spurious…

Machine Learning · Computer Science 2026-05-12 Riccardo Ali , Alessio Borgi , Christopher Irwin , Mario Severino , Pietro Liò

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Inspired by the Bloch Sphere representation, we propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position Encoding (3D-RPE). 3D-RPE is an advanced version of the widely used 2D Rotary Position Encoding…

Computation and Language · Computer Science 2024-06-17 Xindian Ma , Wenyuan Liu , Peng Zhang , Nan Xu

The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges:…

Computer Vision and Pattern Recognition · Computer Science 2025-05-06 Xiaoqi Li , Jiaming Liu , Nuowei Han , Liang Heng , Yandong Guo , Hao Dong , Yang Liu

Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target…

Computer Vision and Pattern Recognition · Computer Science 2023-05-30 Zhihao Yuan , Xu Yan , Zhuo Li , Xuhao Li , Yao Guo , Shuguang Cui , Zhen Li

Sparse query-based paradigms have achieved significant success in multi-view 3D detection for autonomous vehicles. Current research faces challenges in balancing between enlarging receptive fields and reducing interference when aggregating…

Computer Vision and Pattern Recognition · Computer Science 2024-07-25 Jiasen Wang , Zhenglin Li , Ke Sun , Xianyuan Liu , Yang Zhou
‹ Prev 1 2 3 10 Next ›