Related papers: Exploring Transformer Extrapolation

Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding

Built upon the Transformer, large language models (LLMs) have captured worldwide attention due to their remarkable abilities. Nevertheless, all Transformer-based models including LLMs suffer from a preset length limit and can hardly…

Computation and Language · Computer Science 2024-10-08 Liang Zhao , Xiachong Feng , Xiaocheng Feng , Weihong Zhong , Dongliang Xu , Qing Yang , Hongtao Liu , Bing Qin , Ting Liu

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on…

Computation and Language · Computer Science 2024-12-13 Meizhi Zhong , Chen Zhang , Yikun Lei , Xikai Liu , Yan Gao , Yao Hu , Kehai Chen , Min Zhang

Context-aware Biases for Length Extrapolation

Transformers often struggle to generalize to longer sequences than those seen during training, a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing…

Computation and Language · Computer Science 2025-09-23 Ali Veisi , Hamidreza Amirzadeh , Amir Mansourian

A Length-Extrapolatable Transformer

Position modeling plays a critical role in Transformers. In this paper, we focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We define attention resolution as an indicator of extrapolation. Then…

Computation and Language · Computer Science 2022-12-21 Yutao Sun , Li Dong , Barun Patra , Shuming Ma , Shaohan Huang , Alon Benhaim , Vishrav Chaudhary , Xia Song , Furu Wei

ExPe: Exact Positional Encodings for Generative Transformer Models with Extrapolating Capabilities

This paper introduces a novel approach to position embeddings in transformer models, named "Exact Positional Embeddings" (ExPE). An absolute positional embedding method that can extrapolate to sequences of lengths longer than the ones it…

Computation and Language · Computer Science 2025-10-06 Aleksis Datseris , Sylvia Vassileva , Ivan Koychev , Svetla Boytcheva

Transformers Can Achieve Length Generalization But Not Robustly

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively…

Machine Learning · Computer Science 2024-02-15 Yongchao Zhou , Uri Alon , Xinyun Chen , Xuezhi Wang , Rishabh Agarwal , Denny Zhou

Position Encoding with Random Float Sampling Enhances Length Generalization of Transformers

Length generalization is the ability of language models to maintain performance on inputs longer than those seen during pretraining. In this work, we introduce a simple yet powerful position encoding (PE) strategy, Random Float Sampling…

Machine Learning · Computer Science 2026-02-17 Atsushi Shimizu , Shohei Taniguchi , Yutaka Matsuo

Your Transformer May Not be as Powerful as You Expect

Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based…

Machine Learning · Computer Science 2022-10-31 Shengjie Luo , Shanda Li , Shuxin Zheng , Tie-Yan Liu , Liwei Wang , Di He

A Comparative Study on Positional Encoding for Time-frequency Domain Dual-path Transformer-based Source Separation Models

In this study, we investigate the impact of positional encoding (PE) on source separation performance and the generalization ability to long sequences (length extrapolation) in Transformer-based time-frequency (TF) domain dual-path models.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Kohei Saijo , Tetsuji Ogawa

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Length extrapolation permits training a transformer language model on short sequences that preserves perplexities when tested on substantially longer sequences. A relative positional embedding design, ALiBi, has had the widest usage to…

Computation and Language · Computer Science 2023-05-25 Ta-Chung Chi , Ting-Han Fan , Alexander I. Rudnicky , Peter J. Ramadge

Functional Interpolation for Relative Positions Improves Long Context Transformers

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits…

Machine Learning · Computer Science 2024-03-05 Shanda Li , Chong You , Guru Guruganesh , Joshua Ainslie , Santiago Ontanon , Manzil Zaheer , Sumit Sanghai , Yiming Yang , Sanjiv Kumar , Srinadh Bhojanapalli

An Exploration of Length Generalization in Transformer-Based Speech Enhancement

The use of Transformer architectures has facilitated remarkable progress in speech enhancement. Training Transformers using substantially long speech utterances is often infeasible as self-attention suffers from quadratic complexity. It is…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-18 Qiquan Zhang , Hongxu Zhu , Xinyuan Qian , Eliathamby Ambikairajah , Haizhou Li

Wavelet-based Positional Representation for Long Context

In the realm of large-scale language models, a significant challenge arises when extrapolating sequences beyond the maximum allowable length. This is because the model's position embedding mechanisms are limited to positions encountered…

Computation and Language · Computer Science 2025-02-05 Yui Oka , Taku Hasegawa , Kyosuke Nishida , Kuniko Saito

Scaling Laws of RoPE-based Extrapolation

The extrapolation capability of Large Language Models (LLMs) based on Rotary Position Embedding is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by…

Computation and Language · Computer Science 2024-03-14 Xiaoran Liu , Hang Yan , Shuo Zhang , Chenxin An , Xipeng Qiu , Dahua Lin

DAPE: Data-Adaptive Positional Encoding for Length Extrapolation

Positional encoding plays a crucial role in transformers, significantly impacting model performance and length generalization. Prior research has introduced absolute positional encoding (APE) and relative positional encoding (RPE) to…

Computation and Language · Computer Science 2024-11-06 Chuanyang Zheng , Yihang Gao , Han Shi , Minbin Huang , Jingyao Li , Jing Xiong , Xiaozhe Ren , Michael Ng , Xin Jiang , Zhenguo Li , Yu Li

KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation

Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position…

Computation and Language · Computer Science 2022-10-14 Ta-Chung Chi , Ting-Han Fan , Peter J. Ramadge , Alexander I. Rudnicky

CLEX: Continuous Length Extrapolation for Large Language Models

Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding…

Computation and Language · Computer Science 2024-03-26 Guanzheng Chen , Xin Li , Zaiqiao Meng , Shangsong Liang , Lidong Bing

On the token distance modeling ability of higher RoPE attention dimension

Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual…

Computation and Language · Computer Science 2024-10-22 Xiangyu Hong , Che Jiang , Biqing Qi , Fandong Meng , Mo Yu , Bowen Zhou , Jie Zhou

Extrapolation by Association: Length Generalization Transfer in Transformers

Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length…

Computation and Language · Computer Science 2025-08-05 Ziyang Cai , Nayoung Lee , Avi Schwarzschild , Samet Oymak , Dimitris Papailiopoulos

Resonance RoPE: Improving Context Length Generalization of Large Language Models

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with…

Computation and Language · Computer Science 2024-09-05 Suyuchen Wang , Ivan Kobyzev , Peng Lu , Mehdi Rezagholizadeh , Bang Liu