Related papers: Universal Length Generalization with Turing Progra…

Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These…

Computation and Language · Computer Science 2022-11-15 Cem Anil , Yuhuai Wu , Anders Andreassen , Aitor Lewkowycz , Vedant Misra , Vinay Ramasesh , Ambrose Slone , Guy Gur-Ari , Ethan Dyer , Behnam Neyshabur

Transformers Can Achieve Length Generalization But Not Robustly

Length generalization, defined as the ability to extrapolate from shorter training sequences to longer test ones, is a significant challenge for language models. This issue persists even with large-scale Transformers handling relatively…

Machine Learning · Computer Science 2024-02-15 Yongchao Zhou , Uri Alon , Xinyun Chen , Xuezhi Wang , Rishabh Agarwal , Denny Zhou

Extrapolation by Association: Length Generalization Transfer in Transformers

Transformer language models have demonstrated impressive generalization capabilities in natural language domains, yet we lack a fine-grained understanding of how such generalization arises. In this paper, we investigate length…

Computation and Language · Computer Science 2025-08-05 Ziyang Cai , Nayoung Lee , Avi Schwarzschild , Samet Oymak , Dimitris Papailiopoulos

Barriers to Universal Reasoning With Transformers (And How to Overcome Them)

Chain-of-Thought (CoT) has been shown to empirically improve Transformers' performance, and theoretically increase their expressivity to Turing completeness. However, whether Transformers can learn to generalize to CoT traces longer than…

Machine Learning · Computer Science 2026-04-29 Oliver Kraus , Yash Sarrof , Yuekun Yao , Alexander Koller , Michael Hahn

A Formal Framework for Understanding Length Generalization in Transformers

A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the…

Machine Learning · Computer Science 2025-05-01 Xinting Huang , Andy Yang , Satwik Bhattamishra , Yash Sarrof , Andreas Krebs , Hattie Zhou , Preetum Nakkiran , Michael Hahn

What Algorithms can Transformers Learn? A Study in Length Generalization

Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true…

Machine Learning · Computer Science 2023-10-25 Hattie Zhou , Arwen Bradley , Etai Littwin , Noam Razin , Omid Saremi , Josh Susskind , Samy Bengio , Preetum Nakkiran

Randomized Positional Encodings Boost Length Generalization of Transformers

Transformers have impressive generalization capabilities on tasks with a fixed context length. However, they fail to generalize to sequences of arbitrary length, even for seemingly simple tasks such as duplicating a string. Moreover, simply…

Machine Learning · Computer Science 2023-05-29 Anian Ruoss , Grégoire Delétang , Tim Genewein , Jordi Grau-Moya , Róbert Csordás , Mehdi Bennani , Shane Legg , Joel Veness

Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are…

Machine Learning · Computer Science 2025-04-18 Hanseul Cho , Jaeyoung Cha , Srinadh Bhojanapalli , Chulhee Yun

Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks

Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood. Arithmetic tasks provide a controlled framework to explore these capabilities, yet…

Machine Learning · Computer Science 2025-08-07 Xingcheng Xu , Zibo Zhao , Haipeng Zhang , Yanqing Yang

Improving Length-Generalization in Transformers via Task Hinting

It has been observed in recent years that transformers have problems with length generalization for certain types of reasoning and arithmetic tasks. In particular, the performance of a transformer model trained on tasks (say addition) up to…

Machine Learning · Computer Science 2023-10-03 Pranjal Awasthi , Anupam Gupta

Looped Transformers for Length Generalization

Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same…

Machine Learning · Computer Science 2025-05-13 Ying Fan , Yilun Du , Kannan Ramchandran , Kangwook Lee

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on…

Computation and Language · Computer Science 2026-01-29 Zhouqi Hua , Wenwei Zhang , Chengqi Lyu , Yuzhe Gu , Songyang Gao , Kuikun Liu , Dahua Lin , Kai Chen

Non-Asymptotic Length Generalization

Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various…

Machine Learning · Computer Science 2025-06-09 Thomas Chen , Tengyu Ma , Zhiyuan Li

Quantitative Bounds for Length Generalization in Transformers

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025)…

Machine Learning · Computer Science 2025-11-03 Zachary Izzo , Eshaan Nichani , Jason D. Lee

Auto-Regressive Next-Token Predictors are Universal Learners

Large language models display remarkable capabilities in logical and mathematical reasoning, allowing them to solve complex tasks. Interestingly, these abilities emerge in networks trained on the simple task of next-token prediction. In…

Machine Learning · Computer Science 2024-07-31 Eran Malach

Length Generalization Bounds for Transformers

Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length…

Machine Learning · Computer Science 2026-03-04 Andy Yang , Pascal Bergsträßer , Georg Zetzsche , David Chiang , Anthony W. Lin

On the Generalizability of Transformer Models to Code Completions of Different Lengths

The programming landscape is nowadays being reshaped by the advent of Large Language Models (LLMs) able to automate code-related tasks related to code implementation (e.g., code completion) and comprehension (e.g., code summarization). Such…

Software Engineering · Computer Science 2025-01-10 Nathan Cooper , Rosalia Tufano , Gabriele Bavota , Denys Poshyvanyk

Post-Norm can Resharpen Attention

Length Generalization is the essential capacity of autonomous agents to perform tasks in longer contexts than those encountered during training. To systematically study this feat, we test how well models can approximate the next token…

Machine Learning · Computer Science 2026-02-02 Pál Zsámboki , Benjamin Levi , David Ansel Josef Smith , Mitansh Kagalwala , Arlington Kell , Samuel Liechty , Cong Wang

Beyond In-Distribution Success: Scaling Curves of CoT Granularity for Language Model Generalization

Generalization to novel compound tasks under distribution shift is important for deploying transformer-based language models (LMs). This work investigates Chain-of-Thought (CoT) reasoning as a means to enhance OOD generalization. Through…

Computation and Language · Computer Science 2026-03-31 Ru Wang , Wei Huang , Selena Song , Haoyu Zhang , Qian Niu , Yusuke Iwasawa , Yutaka Matsuo , Jiaxian Guo

Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization

The ability to reason lies at the core of artificial intelligence (AI), and challenging problems usually call for deeper and longer reasoning to tackle. A crucial question about AI reasoning is whether models can extrapolate learned…

Machine Learning · Computer Science 2025-11-11 Yu Huang , Zixin Wen , Aarti Singh , Yuejie Chi , Yuxin Chen