English
Related papers

Related papers: Position Embedding Needs an Independent Layer Norm…

200 papers

Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Md Abtahi Majeed Chowdhury , Md Rifat Ur Rahman , Akil Ahmad Taki

In vision transformers, position embedding (PE) plays a crucial role in capturing the order of tokens. However, in vision transformer structures, there is a limitation in the expressiveness of PE due to the structure where position…

Computer Vision and Pattern Recognition · Computer Science 2025-02-06 Wonjun Lee , Bumsub Ham , Suhyun Kim

Although Vision Transformers (ViTs) have recently advanced computer vision tasks significantly, an important real-world problem was overlooked: adapting to variable input resolutions. Typically, images are resized to a fixed resolution,…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Wenzhuo Liu , Fei Zhu , Shijie Ma , Cheng-Lin Liu

Unsupervised learning of vision transformers seeks to pretrain an encoder via pretext tasks without labels. Among them is the Masked Image Modeling (MIM) aligned with pretraining of language transformers by predicting masked patches as a…

Computer Vision and Pattern Recognition · Computer Science 2023-03-15 Xiao Wang , Ying Wang , Ziwei Xuan , Guo-Jun Qi

How discriminative position information is for image classification depends on the data. On the one hand, the camera position is arbitrary and objects can appear anywhere in the image, arguing for translation invariance. At the same time,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Robert-Jan Bruintjes , Jan van Gemert

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and…

Computer Vision and Pattern Recognition · Computer Science 2021-07-30 Kan Wu , Houwen Peng , Minghao Chen , Jianlong Fu , Hongyang Chao

We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each…

Without positional information, attention-based Transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed Transformer models with positional information. Absolute…

Machine Learning · Computer Science 2021-11-10 Tatiana Likhomanenko , Qiantong Xu , Gabriel Synnaeve , Ronan Collobert , Alex Rogozhnikov

Rotary Position Embedding (RoPE) performs remarkably on language models, especially for length extrapolation of Transformers. However, the impacts of RoPE on computer vision domains have been underexplored, even though RoPE appears capable…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Byeongho Heo , Song Park , Dongyoon Han , Sangdoo Yun

We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned…

Computer Vision and Pattern Recognition · Computer Science 2023-02-14 Xiangxiang Chu , Zhi Tian , Bo Zhang , Xinlong Wang , Chunhua Shen

The use of self-supervised pre-training has emerged as a promising approach to enhance the performance of many different visual tasks. In this context, recent approaches have employed the Masked Image Modeling paradigm, which pre-trains a…

Computer Vision and Pattern Recognition · Computer Science 2025-01-23 Lorenzo Baraldi , Roberto Amoroso , Marcella Cornia , Lorenzo Baraldi , Andrea Pilzer , Rita Cucchiara

A common strategy for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViTs) involves adapting the model to downstream tasks by learning a low-rank adaptation matrix. This matrix is decomposed into a product of…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Wei Dong , Yuan Sun , Yiting Yang , Xing Zhang , Zhijun Lin , Qingsen Yan , Haokui Zhang , Peng Wang , Yang Yang , Hengtao Shen

Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce…

Computation and Language · Computer Science 2025-08-22 Jiajun Zhu , Peihao Wang , Ruisi Cai , Jason D. Lee , Pan Li , Zhangyang Wang

Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the…

Computer Vision and Pattern Recognition · Computer Science 2023-05-29 Bin Ren , Yahui Liu , Yue Song , Wei Bi , Rita Cucchiara , Nicu Sebe , Wei Wang

Vision Transformers have demonstrated remarkable success in computer vision tasks, yet their reliance on learnable one-dimensional positional embeddings fundamentally disrupts the inherent two-dimensional spatial structure of images through…

Computer Vision and Pattern Recognition · Computer Science 2025-08-27 Zhihang Xin , Xitong Hu , Rui Wang

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position…

Machine Learning · Computer Science 2025-06-18 Huayang Li , Yahui Liu , Hongyu Sun , Deng Cai , Leyang Cui , Wei Bi , Peilin Zhao , Taro Watanabe

Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding…

Computer Vision and Pattern Recognition · Computer Science 2025-05-15 Xi Chen , Shiyang Zhou , Muqi Huang , Jiaxu Feng , Yun Xiong , Kun Zhou , Biao Yang , Yuhui Zhang , Huishuai Bao , Sijia Peng , Chuan Li , Feng Shi

Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the development of Transformer-based language models. Positional encoding (PE) has been identified as a major…

Computation and Language · Computer Science 2023-11-08 Amirhossein Kazemnejad , Inkit Padhi , Karthikeyan Natesan Ramamurthy , Payel Das , Siva Reddy

In transformers, the positional encoding (PE) provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input…

Machine Learning · Computer Science 2025-06-24 Takuya Ito , Luca Cocchi , Tim Klinger , Parikshit Ram , Murray Campbell , Luke Hearne

The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Zhengyang Yu , Jochen Triesch
‹ Prev 1 2 3 10 Next ›