Related papers: Linearizing Large Language Models

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky.…

Computation and Language · Computer Science 2025-05-08 Disen Lan , Weigao Sun , Jiaxi Hu , Jusen Du , Yu Cheng

Mamba-PTQ: Outlier Channels in Recurrent Large Language Models

Modern recurrent layers are emerging as a promising path toward edge deployment of foundation models, especially in the context of large language models (LLMs). Compressing the whole input sequence in a finite-dimensional representation…

Machine Learning · Computer Science 2024-07-18 Alessandro Pierro , Steven Abreu

Lizard: An Efficient Linearization Framework for Large Language Models

We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into subquadratic architectures. Transformers faces severe computational and memory bottlenecks with long sequences due…

Computation and Language · Computer Science 2026-04-21 Chien Van Nguyen , Huy Nguyen , Ruiyi Zhang , Hanieh Deilamsalehy , Puneet Mathur , Viet Dac Lai , Haoliang Wang , Jayakumar Subramanian , Ryan A. Rossi , Trung Bui , Nikos Vlassis , Franck Dernoncourt , Thien Huu Nguyen

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model…

Machine Learning · Computer Science 2025-02-27 Yingyu Liang , Jiangxuan Long , Zhenmei Shi , Zhao Song , Yufa Zhou

RecurFormer: Not All Transformer Heads Need Self-Attention

Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe…

Computation and Language · Computer Science 2024-10-18 Ruiqing Yan , Linghan Zheng , Xingbo Du , Han Zou , Yufeng Guo , Jianfei Yang

How Effective are State Space Models for Machine Translation?

Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts. Recent works propose to replace attention with linear recurrent layers -- this is the case for state space models,…

Computation and Language · Computer Science 2024-07-09 Hugo Pitorro , Pavlo Vasylenko , Marcos Treviso , André F. T. Martins

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform…

Machine Learning · Computer Science 2025-01-16 Songlin Yang , Bailin Wang , Yu Zhang , Yikang Shen , Yoon Kim

Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models

Transformer-based embedding models suffer from quadratic computational and linear memory complexity, limiting their utility for long sequences. We propose recurrent architectures as an efficient alternative, introducing a vertically chunked…

Computation and Language · Computer Science 2026-04-21 Tobias Grantner , Emanuel Sallinger , Martin Flechl

MesaNet: Sequence Modeling by Locally Optimal Test-Time Training

Sequence modeling is currently dominated by causal transformer architectures that use softmax self-attention. Although widely adopted, transformers require scaling memory and compute linearly during inference. A recent stream of work…

Machine Learning · Computer Science 2025-06-06 Johannes von Oswald , Nino Scherrer , Seijin Kobayashi , Luca Versari , Songlin Yang , Maximilian Schlegel , Kaitlin Maile , Yanick Schimpf , Oliver Sieberling , Alexander Meulemans , Rif A. Saurous , Guillaume Lajoie , Charlotte Frenkel , Razvan Pascanu , Blaise Agüera y Arcas , João Sacramento

Compressing Large Language Models with Automated Sub-Network Search

Large Language Models (LLMs) demonstrate exceptional reasoning abilities, enabling strong generalization across diverse tasks such as commonsense reasoning and instruction following. However, as LLMs scale, inference costs become…

Computation and Language · Computer Science 2025-02-06 Rhea Sanjay Sukthanker , Benedikt Staffler , Frank Hutter , Aaron Klein

Simple and Scalable Strategies to Continually Pre-train Large Language Models

Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. A much more efficient solution is to continually pre-train these models, saving significant…

Machine Learning · Computer Science 2024-09-05 Adam Ibrahim , Benjamin Thérien , Kshitij Gupta , Mats L. Richter , Quentin Anthony , Timothée Lesort , Eugene Belilovsky , Irina Rish

RWKV: Reinventing RNNs for the Transformer Era

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit…

Computation and Language · Computer Science 2023-12-12 Bo Peng , Eric Alcaide , Quentin Anthony , Alon Albalak , Samuel Arcadinho , Stella Biderman , Huanqi Cao , Xin Cheng , Michael Chung , Matteo Grella , Kranthi Kiran GV , Xuzheng He , Haowen Hou , Jiaju Lin , Przemyslaw Kazienko , Jan Kocon , Jiaming Kong , Bartlomiej Koptyra , Hayden Lau , Krishna Sri Ipsit Mantri , Ferdinand Mom , Atsushi Saito , Guangyu Song , Xiangru Tang , Bolun Wang , Johan S. Wind , Stanislaw Wozniak , Ruichong Zhang , Zhenyuan Zhang , Qihang Zhao , Peng Zhou , Qinghua Zhou , Jian Zhu , Rui-Jie Zhu

Pay Attention to What You Need

Although large language models (LLMs) have achieved significant success in natural language processing, they still struggle with long-context comprehension. Traditional approaches to mitigating this issue typically rely on fine-tuning or…

Computation and Language · Computer Science 2025-02-25 Yifei Gao , Shaohong Chen , Lei Wang , Ruiting Dai , Ziyun Zhang , Kerui Ren , Jiaji Wu , Jun Cheng

Finetuning Pretrained Transformers into RNNs

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a significant computational cost, as the attention mechanism's complexity scales quadratically with sequence length.…

Computation and Language · Computer Science 2021-09-21 Jungo Kasai , Hao Peng , Yizhe Zhang , Dani Yogatama , Gabriel Ilharco , Nikolaos Pappas , Yi Mao , Weizhu Chen , Noah A. Smith

What Matters in Linearizing Language Models? A Comparative Study of Architecture, Scale, and Task Adaptation

Linearization has emerged as a strategy for developing efficient language models (LMs). Starting from an existing Transformer-based LM, linearization replaces the attention component with computationally efficient subquadratic \textit{token…

Computation and Language · Computer Science 2026-02-02 Patrick Haller , Jonas Golde , Alan Akbik

Advancing Regular Language Reasoning in Linear Recurrent Neural Networks

In recent studies, linear recurrent neural networks (LRNNs) have achieved Transformer-level performance in natural language and long-range modeling, while offering rapid parallel training and constant inference cost. With the resurgence of…

Computation and Language · Computer Science 2024-04-10 Ting-Han Fan , Ta-Chung Chi , Alexander I. Rudnicky

Linearizing Vision Transformer with Test-Time Training

While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Yining Li , Dongchen Han , Zeyu Liu , Hanyi Wang , Yulin Wang , Gao Huang

Reversing Large Language Models for Efficient Training and Fine-Tuning

Large Language Models (LLMs) are known for their expensive and time-consuming training. Thus, oftentimes, LLMs are fine-tuned to address a specific task, given the pretrained weights of a pre-trained LLM considered a foundation model. In…

Computation and Language · Computer Science 2025-12-05 Eshed Gal , Moshe Eliasof , Javier Turek , Uri Ascher , Eran Treister , Eldad Haber

Revisiting associative recall in modern recurrent models

Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and…

Machine Learning · Computer Science 2025-10-13 Destiny Okpekpe , Antonio Orvieto

LoLCATs: On Low-Rank Linearizing of Large Language Models

Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However,…

Machine Learning · Computer Science 2025-03-07 Michael Zhang , Simran Arora , Rahul Chalamala , Alan Wu , Benjamin Spector , Aaryan Singhal , Krithik Ramesh , Christopher Ré