English
Related papers

Related papers: Weierstrass Positional Encoding for Vision Transfo…

200 papers

Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through…

Computer Vision and Pattern Recognition · Computer Science 2025-08-27 Zhihang Xin , Xitong Hu , Rui Wang

Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Xi Chen , Shiyang Zhou , Muqi Huang , Jiaxu Feng , Yun Xiong , Kun Zhou , Biao Yang , Yuhui Zhang , Huishuai Bao , Sijia Peng , Chuan Li , Feng Shi

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and…

Computer Vision and Pattern Recognition · Computer Science 2021-07-30 Kan Wu , Houwen Peng , Minghao Chen , Jianlong Fu , Hongyang Chao

Standard Vision Transformers flatten 2D images into 1D sequences, disrupting the natural spatial topology. While Rotary Positional Embedding (RoPE) excels in 1D, it inherits this limitation, often treating spatially distant patches (e.g.,…

Computer Vision and Pattern Recognition · Computer Science 2025-12-05 Yupu Yao , Bowen Yang

Tables are ubiquitous across various domains for concisely representing structured information. Empowering large language models (LLMs) to reason over tabular data represents an actively explored direction. However, since typical LLMs only…

Computation and Language · Computer Science 2024-10-21 Jia-Nan Li , Jian Guan , Wei Wu , Zhengtao Yu , Rui Yan

Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Haoyu Liu , Sucheng Ren , Tingyu Zhu , Peng Wang , Cihang Xie , Alan Yuille , Zeyu Zheng , Feng Wang

Recent studies have demonstrated the effectiveness of position encoding in transformer architectures. By incorporating positional information, this approach provides essential guidance for modeling dependencies between elements across…

Machine Learning · Computer Science 2025-08-27 Avinash Amballa

Transformer architectures rely on position encodings to model the spatial structure of input data. Rotary Position Encoding (RoPE) is a widely used method in language models that encodes relative positions through fixed, block-diagonal,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Sophie Ostmeier , Brian Axelrod , Maya Varma , Michael E. Moseley , Akshay Chaudhari , Curtis Langlotz

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position…

Machine Learning · Computer Science 2025-06-18 Huayang Li , Yahui Liu , Hongyu Sun , Deng Cai , Leyang Cui , Wei Bi , Peilin Zhao , Taro Watanabe

Multimodal time series forecasting is foundational in various fields, such as utilizing satellite imagery and numerical data for predicting typhoons in climate science. However, existing multimodal approaches primarily focus on utilizing…

Machine Learning · Computer Science 2025-06-19 Haobo Li , Eunseo Jung , Zixin Chen , Zhaowei Wang , Yueya Wang , Huamin Qu , Alexis Kai Hon Lau

Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-22 Yichen Xie , Depu Meng , Chensheng Peng , Yihan Hu , Quentin Herau , Masayoshi Tomizuka , Wei Zhan

We study positional encodings for multi-view transformers that process tokens from a set of posed input images, and seek a mechanism that encodes patches uniquely, allows $SE(3)$-invariant attention with multi-frequency similarity, and can…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Yu Wu , Minsik Jeon , Jen-Hao Rick Chang , Oncel Tuzel , Shubham Tulsiani

Diffusion Transformers (DiTs) have emerged as the dominant architecture for visual generation, powering state-of-the-art image and video models. By representing images as patch tokens with positional encodings (PEs), DiTs combine…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Yunpeng Bai , Haoxiang Li , Qixing Huang

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our…

Implicit neural representations (INRs) are increasingly being used as tools to map coordinates to signals, encompassing applications from neural fields to texture compression, shape representations, and beyond. Most INR methods are based on…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Guillaume Perez , Janarbek Matai , Takahiro Harada

Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across…

Computer Vision and Pattern Recognition · Computer Science 2025-02-13 Zhanpeng Chen , Mingxiao Li , Ziyang Chen , Nan Du , Xiaolong Li , Yuexian Zou

Transformers are increasingly prevalent for multi-view computer vision tasks, where geometric relationships between viewpoints are critical for 3D perception. To leverage these relationships, multi-view transformers must use camera geometry…

Computer Vision and Pattern Recognition · Computer Science 2025-11-14 Ruilong Li , Brent Yi , Junchen Liu , Hang Gao , Yi Ma , Angjoo Kanazawa

The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization,…

Computer Vision and Pattern Recognition · Computer Science 2022-12-23 Runyi Yu , Zhennan Wang , Yinhuai Wang , Kehan Li , Yian Zhao , Jian Zhang , Guoli Song , Jie Chen

Positional Encodings (PEs) are used to inject word-order information into transformer-based language models. While they can significantly enhance the quality of sentence representations, their specific contribution to language models is not…

Computation and Language · Computer Science 2023-10-20 Lihu Chen , Gaël Varoquaux , Fabian M. Suchanek

Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Md Abtahi Majeed Chowdhury , Md Rifat Ur Rahman , Akil Ahmad Taki
‹ Prev 1 2 3 10 Next ›