Related papers: MEP: Multiple Kernel Learning Enhancing Relative P…

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We…

Computation and Language · Computer Science 2022-04-26 Ofir Press , Noah A. Smith , Mike Lewis

HyPE: Attention with Hyperbolic Biases for Relative Positional Encoding

In Transformer-based architectures, the attention mechanism is inherently permutation-invariant with respect to the input sequence's tokens. To impose sequential order, token positions are typically encoded using a scheme with either fixed…

Machine Learning · Computer Science 2023-10-31 Giorgio Angelotti

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate…

Computation and Language · Computer Science 2026-05-11 Arthur S. Bianchessi , Yasmin C. Aguirre , Rodrigo C. Barros , Lucas S. Kupssinskü

Maximum Entropy on Erroneous Predictions (MEEP): Improving model calibration for medical image segmentation

Modern deep neural networks achieved remarkable progress in medical image segmentation tasks. However, it has recently been observed that they tend to produce overconfident estimates, even in situations of high uncertainty, leading to…

Computer Vision and Pattern Recognition · Computer Science 2023-06-05 Agostina Larrazabal , Cesar Martinez , Jose Dolz , Enzo Ferrante

Towards Infinite Length Extrapolation: A Unified Approach

Large language models (LLMs) have revolutionized natural language processing, but their ability to process long sequences is fundamentally limited by the context window size during training. Existing length extrapolation methods often…

Artificial Intelligence · Computer Science 2026-01-13 Nitin Vetcha

Context-aware Biases for Length Extrapolation

Transformers often struggle to generalize to longer sequences than those seen during training, a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing…

Computation and Language · Computer Science 2025-09-23 Ali Veisi , Hamidreza Amirzadeh , Amir Mansourian

MEC: Machine-Learning-Assisted Generalized Entropy Calibration for Semi-Supervised Mean Estimation

Obtaining high-quality labels is costly, whereas unlabeled covariates are often abundant, motivating semi-supervised inference methods with reliable uncertainty quantification. Prediction-powered inference (PPI) leverages a machine-learning…

Machine Learning · Statistics 2026-05-29 Se Yoon Lee , Jae Kwang Kim

Self-Supervised Learning via Maximum Entropy Coding

A mainstream type of current self-supervised learning methods pursues a general-purpose representation that can be well transferred to downstream tasks, typically by optimizing on a given pretext task such as instance discrimination. In…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Xin Liu , Zhongdao Wang , Yali Li , Shengjin Wang

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position…

Computation and Language · Computer Science 2022-10-14 Ta-Chung Chi , Ting-Han Fan , Peter J. Ramadge , Alexander I. Rudnicky

Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs

Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we…

Machine Learning · Computer Science 2024-10-25 Xin Ma , Yang Liu , Jingjing Liu , Xiaoxu Ma

Wavelet-based Positional Representation for Long Context

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered…

Computation and Language · Computer Science 2025-02-05 Yui Oka , Taku Hasegawa , Kyosuke Nishida , Kuniko Saito

HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term…

Computation and Language · Computer Science 2024-12-06 Yuhan Chen , Ang Lv , Jian Luan , Bin Wang , Wei Liu

SeqPE: Transformer with Sequential Position Encoding

Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position…

Machine Learning · Computer Science 2025-06-18 Huayang Li , Yahui Liu , Hongyu Sun , Deng Cai , Leyang Cui , Wei Bi , Peilin Zhao , Taro Watanabe

MaPPER: Multimodal Prior-guided Parameter Efficient Tuning for Referring Expression Comprehension

Referring Expression Comprehension (REC), which aims to ground a local visual region via natural language, is a task that heavily relies on multimodal alignment. Most existing methods utilize powerful pre-trained models to transfer…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Ting Liu , Zunnan Xu , Yue Hu , Liangtao Shi , Zhiqiang Wang , Quanjun Yin

Bifocal Attention: Harmonizing Geometric and Spectral Positional Embeddings for Algorithmic Generalization

Rotary Positional Embeddings (RoPE) have become the standard for Large Language Models (LLMs) due to their ability to encode relative positions through geometric rotation. However, we identify a significant limitation we term ''Spectral…

Computation and Language · Computer Science 2026-02-02 Kanishk Awadhiya

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to…

Computation and Language · Computer Science 2023-05-25 Ta-Chung Chi , Ting-Han Fan , Alexander I. Rudnicky , Peter J. Ramadge

Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization

Positional encodings are a core part of transformer-based models, enabling processing of sequential data without recurrence. This paper presents a theoretical framework to analyze how various positional encoding methods, including…

Machine Learning · Computer Science 2025-06-10 Yin Li

Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts

Mixture-of-Experts (MoE) models are typically pre-trained with explicit load-balancing constraints to ensure statistically balanced expert routing. Despite this, we observe that even well-trained MoE models exhibit significantly imbalanced…

Machine Learning · Computer Science 2026-01-27 Xuan-Phi Nguyen , Shrey Pandit , Austin Xu , Caiming Xiong , Shafiq Joty

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly…

Computation and Language · Computer Science 2026-03-16 Xialie Zhuang , Zhikai Jia , Jianjin Li , Zhenyu Zhang , Li Shen , Zheng Cao , Shiwei Liu

Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation

In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an…

Machine Learning · Computer Science 2024-06-18 Zhenyu He , Guhao Feng , Shengjie Luo , Kai Yang , Liwei Wang , Jingjing Xu , Zhi Zhang , Hongxia Yang , Di He