Related papers: Position Interpolation Improves ALiBi Extrapolatio…

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question has yet to be answered: how does a model achieve extrapolation at inference time for sequences that are longer than it saw during training? We…

Computation and Language · Computer Science 2022-04-26 Ofir Press , Noah A. Smith , Mike Lewis

Extending Context Window of Large Language Models via Positional Interpolation

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on…

Computation and Language · Computer Science 2023-06-29 Shouyuan Chen , Sherman Wong , Liangjian Chen , Yuandong Tian

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on…

Computation and Language · Computer Science 2024-12-13 Meizhi Zhong , Chen Zhang , Yikun Lei , Xikai Liu , Yan Gao , Yao Hu , Kehai Chen , Min Zhang

Wavelet-based Positional Representation for Long Context

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered…

Computation and Language · Computer Science 2025-02-05 Yui Oka , Taku Hasegawa , Kyosuke Nishida , Kuniko Saito

Scaling Laws of RoPE-based Extrapolation

The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by…

Computation and Language · Computer Science 2024-03-14 Xiaoran Liu , Hang Yan , Shuo Zhang , Chenxin An , Xipeng Qiu , Dahua Lin

A Length-Extrapolatable Transformer

Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then…

Computation and Language · Computer Science 2022-12-21 Yutao Sun , Li Dong , Barun Patra , Shuming Ma , Shaohan Huang , Alon Benhaim , Vishrav Chaudhary , Xia Song , Furu Wei

On the token distance modeling ability of higher RoPE attention dimension

Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual…

Computation and Language · Computer Science 2024-10-22 Xiangyu Hong , Che Jiang , Biqing Qi , Fandong Meng , Mo Yu , Bowen Zhou , Jie Zhou

Mesa-Extrapolation: A Weave Position Encoding Method for Enhanced Extrapolation in LLMs

Large language models (LLMs), although having revolutionized many fields, still suffer from the challenging extrapolation problem, where the inference ability of LLMs sharply declines beyond their max training lengths. In this work, we…

Machine Learning · Computer Science 2024-10-25 Xin Ma , Yang Liu , Jingjing Liu , Xiaoxu Ma

Model Extrapolation Expedites Alignment

Given the high computational cost of preference alignment training of large language models (LLMs), exploring efficient methods to reduce the training overhead remains an important and compelling research problem. Motivated by the…

Machine Learning · Computer Science 2025-06-02 Chujie Zheng , Ziqi Wang , Heng Ji , Minlie Huang , Nanyun Peng

DoPE: Denoising Rotary Position Embedding

Positional encoding is essential for large language models (LLMs) to represent sequence order, yet recent studies show that Rotary Position Embedding (RoPE) can induce massive activation. We investigate the source of these instabilities via…

Computation and Language · Computer Science 2026-01-07 Jing Xiong , Liyang Fan , Hui Shen , Zunhai Su , Min Yang , Lingpeng Kong , Ngai Wong

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to…

Computation and Language · Computer Science 2023-05-25 Ta-Chung Chi , Ting-Han Fan , Alexander I. Rudnicky , Peter J. Ramadge

Theoretical Analysis of Positional Encodings in Transformer Models: Impact on Expressiveness and Generalization

Positional encodings are a core part of transformer-based models, enabling processing of sequential data without recurrence. This paper presents a theoretical framework to analyze how various positional encoding methods, including…

Machine Learning · Computer Science 2025-06-10 Yin Li

Extending LLMs' Context Window with 100 Samples

Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context…

Computation and Language · Computer Science 2024-01-17 Yikai Zhang , Junlong Li , Pengfei Liu

ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

This paper introduces a novel approach to position embeddings in transformer models, named "Exact Positional Embeddings" (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it…

Computation and Language · Computer Science 2025-10-06 Aleksis Datseris , Sylvia Vassileva , Ivan Koychev , Svetla Boytcheva

Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs

Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models (LLMs) by applying rotations to query and key vectors in the complex plane. Standard implementations, however, utilize only the…

Computation and Language · Computer Science 2025-12-09 Xiaoran Liu , Yuerong Song , Zhigeng Liu , Zengfeng Huang , Qipeng Guo , Zhaoxiang Liu , Shiguo Lian , Ziwei He , Xipeng Qiu

Functional Interpolation for Relative Positions Improves Long Context Transformers

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits…

Machine Learning · Computer Science 2024-03-05 Shanda Li , Chong You , Guru Guruganesh , Joshua Ainslie , Santiago Ontanon , Manzil Zaheer , Sumit Sanghai , Yiming Yang , Sanjiv Kumar , Srinadh Bhojanapalli

A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)

Transformer-based Large Language Models (LLMs) struggle with inputs exceeding their training context window due to positional out-of-distribution (O.O.D.) issues that disrupt attention. Existing solutions, including fine-tuning and…

Computation and Language · Computer Science 2025-06-02 Yan Li , Tianyi Zhang , Zechuan Li , Soyeon Caren Han

Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding

Built upon the Transformer, large language models (LLMs) have captured worldwide attention due to their remarkable abilities. Nevertheless, all Transformer-based models including LLMs suffer from a preset length limit and can hardly…

Computation and Language · Computer Science 2024-10-08 Liang Zhao , Xiachong Feng , Xiaocheng Feng , Weihong Zhong , Dongliang Xu , Qing Yang , Hongtao Liu , Bing Qin , Ting Liu

Attention Alignment and Flexible Positional Embeddings Improve Transformer Length Extrapolation

An ideal length-extrapolatable Transformer language model can handle sequences longer than the training length without any fine-tuning. Such long-context utilization capability relies heavily on a flexible positional embedding design. Upon…

Computation and Language · Computer Science 2023-11-16 Ta-Chung Chi , Ting-Han Fan , Alexander I. Rudnicky

Location Attention for Extrapolation to Longer Sequences

Neural networks are surprisingly good at interpolating and perform remarkably well when the training set examples resemble those in the test set. However, they are often unable to extrapolate patterns beyond the seen data, even when the…

Machine Learning · Computer Science 2020-04-23 Yann Dubois , Gautier Dagan , Dieuwke Hupkes , Elia Bruni