Related papers: Position Embedding Needs an Independent Layer Norm…

LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Md Abtahi Majeed Chowdhury , Md Rifat Ur Rahman , Akil Ahmad Taki

Maximizing the Position Embedding for Vision Transformers with Global Average Pooling

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position…

Computer Vision and Pattern Recognition · Computer Science 2025-02-06 Wonjun Lee , Bumsub Ham , Suhyun Kim

MSPE: Multi-Scale Patch Embedding Prompts Vision Transformers to Any Resolution

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Wenzhuo Liu , Fei Zhu , Shijie Ma , Cheng-Lin Liu

AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a…

Computer Vision and Pattern Recognition · Computer Science 2023-03-15 Xiao Wang , Ying Wang , Ziwei Xuan , Guo-Jun Qi

Learning to Adapt to Position Bias in Vision Transformer Classifiers

How discriminative position information is for image classification depends on the data. On the one hand, the camera position is arbitrary and objects can appear anywhere in the image, arguing for translation invariance. At the same time,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Robert-Jan Bruintjes , Jan van Gemert

Rethinking and Improving Relative Position Encoding for Vision Transformer

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and…

Computer Vision and Pattern Recognition · Computer Science 2021-07-30 Kan Wu , Houwen Peng , Minghao Chen , Jianlong Fu , Hongyang Chao

Perception Encoder: The best visual embeddings are not at the output of the network

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Daniel Bolya , Po-Yao Huang , Peize Sun , Jang Hyun Cho , Andrea Madotto , Chen Wei , Tengyu Ma , Jiale Zhi , Jathushan Rajasegaran , Hanoona Rasheed , Junke Wang , Marco Monteiro , Hu Xu , Shiyu Dong , Nikhila Ravi , Daniel Li , Piotr Dollár , Christoph Feichtenhofer

CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Without positional information, attention-based Transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed Transformer models with positional information. Absolute…

Machine Learning · Computer Science 2021-11-10 Tatiana Likhomanenko , Qiantong Xu , Gabriel Synnaeve , Ronan Collobert , Alex Rogozhnikov

Rotary Position Embedding for Vision Transformer

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Byeongho Heo , Song Park , Dongyoon Han , Sangdoo Yun

Conditional Positional Encodings for Vision Transformers

We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned…

Computer Vision and Pattern Recognition · Computer Science 2023-02-14 Xiangxiang Chu , Zhi Tian , Bo Zhang , Xinlong Wang , Chunhua Shen

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a…

Computer Vision and Pattern Recognition · Computer Science 2025-01-23 Lorenzo Baraldi , Roberto Amoroso , Marcella Cornia , Lorenzo Baraldi , Andrea Pilzer , Rita Cucchiara

Efficient Adaptation of Pre-trained Vision Transformer via Householder Transformation

A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Wei Dong , Yuan Sun , Yiting Yang , Xing Zhang , Zhijun Lin , Qingsen Yan , Haokui Zhang , Peng Wang , Yang Yang , Hengtao Shen

Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding

Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce…

Computation and Language · Computer Science 2025-08-22 Jiajun Zhu , Peihao Wang , Ruisi Cai , Jason D. Lee , Pan Li , Zhangyang Wang

Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the…

Computer Vision and Pattern Recognition · Computer Science 2023-05-29 Bin Ren , Yahui Liu , Yue Song , Wei Bi , Rita Cucchiara , Nicu Sebe , Wei Wang

Beyond flattening: a geometrically principled positional encoding for vision transformers with Weierstrass elliptic functions

Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through…

Computer Vision and Pattern Recognition · Computer Science 2025-08-27 Zhihang Xin , Xitong Hu , Rui Wang

SeqPE: Transformer with Sequential Position Encoding

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position…

Machine Learning · Computer Science 2025-06-18 Huayang Li , Yahui Liu , Hongyu Sun , Deng Cai , Leyang Cui , Wei Bi , Peilin Zhao , Taro Watanabe

A 2D Semantic-Aware Position Encoding for Vision Transformers

Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Xi Chen , Shiyang Zhou , Muqi Huang , Jiaxu Feng , Yun Xiong , Kun Zhou , Biao Yang , Yuhui Zhang , Huishuai Bao , Sijia Peng , Chuan Li , Feng Shi

The Impact of Positional Encoding on Length Generalization in Transformers

Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major…

Computation and Language · Computer Science 2023-11-08 Amirhossein Kazemnejad , Inkit Padhi , Karthikeyan Natesan Ramamurthy , Payel Das , Siva Reddy

Learning interpretable positional encodings in transformers depends on initialization

In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input…

Machine Learning · Computer Science 2025-06-24 Takuya Ito , Luca Cocchi , Tim Klinger , Parikshit Ram , Murray Campbell , Luke Hearne

Sequence and Circle: Exploring the Relationship Between Patches

The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zhengyang Yu , Jochen Triesch