Related papers: Lizard: An Efficient Linearization Framework for L…

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model…

Machine Learning · Computer Science 2025-02-27 Yingyu Liang , Jiangxuan Long , Zhenmei Shi , Zhao Song , Yufa Zhou

Zebra: Extending Context Window with Layerwise Grouped Local-Global Attention

This paper introduces a novel approach to enhance the capabilities of Large Language Models (LLMs) in processing and understanding extensive text sequences, a critical aspect in applications requiring deep comprehension and synthesis of…

Computation and Language · Computer Science 2023-12-15 Kaiqiang Song , Xiaoyang Wang , Sangwoo Cho , Xiaoman Pan , Dong Yu

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky.…

Computation and Language · Computer Science 2025-05-08 Disen Lan , Weigao Sun , Jiaxi Hu , Jusen Du , Yu Cheng

LoLCATs: On Low-Rank Linearizing of Large Language Models

Recent works show we can linearize large language models (LLMs) -- swapping the quadratic attentions of popular Transformer-based LLMs with subquadratic analogs, such as linear attention -- avoiding the expensive pretraining costs. However,…

Machine Learning · Computer Science 2025-03-07 Michael Zhang , Simran Arora , Rahul Chalamala , Alan Wu , Benjamin Spector , Aaryan Singhal , Krithik Ramesh , Christopher Ré

ReGLA: Refining Gated Linear Attention

Recent advancements in Large Language Models (LLMs) have set themselves apart with their exceptional performance in complex language modelling tasks. However, these models are also known for their significant computational and storage…

Computation and Language · Computer Science 2025-08-12 Peng Lu , Ivan Kobyzev , Mehdi Rezagholizadeh , Boxing Chen , Philippe Langlais

Linearizing Large Language Models

Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers…

Computation and Language · Computer Science 2024-05-13 Jean Mercat , Igor Vasiljevic , Sedrick Keh , Kushal Arora , Achal Dave , Adrien Gaidon , Thomas Kollar

LatentLLM: Attention-Aware Joint Tensor Compression

Modern foundation models such as large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources. We propose a new framework to convert such LLMs/LMMs into a reduced-dimension…

Machine Learning · Computer Science 2025-05-27 Toshiaki Koike-Akino , Xiangyu Chen , Jing Liu , Ye Wang , Pu , Wang , Matthew Brand

LT2: Linear-Time Looped Transformers

Looped Transformers (LT) have emerged as a powerful architecture by iterating their layers multiple times before decoding the final token. However, pairing them with full attention retains quadratic complexity, making them computationally…

Machine Learning · Computer Science 2026-05-26 Chunyuan Deng , Yizhe Zhang , Rui-Jie Zhu , Yuanyuan Xu , Jiarui Liu , T. S. Eugene Ng , Hanjie Chen

Parallax: Parameterized Local Linear Attention for Language Modeling

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived…

Machine Learning · Computer Science 2026-05-29 Yifei Zuo , Dhruv Pai , Zhichen Zeng , Alec Dewulf , Shuming Hu , Zhaoran Wang

An Iterative Algorithm for Rescaled Hyperbolic Functions Regression

Large language models (LLMs) have numerous real-life applications across various domains, such as natural language translation, sentiment analysis, language modeling, chatbots and conversational agents, creative writing, text…

Machine Learning · Computer Science 2025-02-18 Yeqi Gao , Zhao Song , Junze Yin

LASER: Attention with Exponential Transformation

Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's…

Machine Learning · Computer Science 2025-07-15 Sai Surya Duvvuri , Inderjit S. Dhillon

The Expressibility of Polynomial based Attention Scheme

Large language models (LLMs) have significantly improved various aspects of our daily lives. These models have impacted numerous domains, from healthcare to education, enhancing productivity, decision-making processes, and accessibility. As…

Machine Learning · Computer Science 2023-11-01 Zhao Song , Guangyi Xu , Junze Yin

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

On The Application of Linear Attention in Multimodal Transformers

Multimodal Transformers serve as the backbone for state-of-the-art vision-language models, yet their quadratic attention complexity remains a critical barrier to scalability. In this work, we investigate the viability of Linear Attention…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Armin Gerami , Seyedehanita Madani , Ramani Duraiswami

Linear Predictability of Attention Heads in Large Language Models

Large language model (LLM) inference is increasingly bottlenecked by the Key-Value (KV) cache, yet the fine-grained structure of attention-head activations remains poorly understood. We show that pretrained Transformers exhibit a pervasive…

Machine Learning · Computer Science 2026-03-17 Khalid Shaikh , Asmit Kumar Singh , Rebecca Christopher Dsouza , Shikhar Shiromani

Sliding Window Attention Training for Efficient Large Language Models

Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck…

Computation and Language · Computer Science 2025-06-05 Zichuan Fu , Wentao Song , Yejing Wang , Xian Wu , Yefeng Zheng , Yingying Zhang , Derong Xu , Xuetao Wei , Tong Xu , Xiangyu Zhao

STILL: Selecting Tokens for Intra-Layer Hybrid Attention to Linearize LLMs

Linearizing pretrained large language models (LLMs) primarily relies on intra-layer hybrid attention mechanisms to alleviate the quadratic complexity of standard softmax attention. Existing methods perform token routing based on…

Machine Learning · Computer Science 2026-02-03 Weikang Meng , Liangyu Huo , Yadan Luo , Jiawen Guan , Jingyi Zhang , Yingjian Li , Zheng Zhang

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear…

Computation and Language · Computer Science 2024-01-22 Zhen Qin , Dong Li , Weigao Sun , Weixuan Sun , Xuyang Shen , Xiaodong Han , Yunshen Wei , Baohong Lv , Xiao Luo , Yu Qiao , Yiran Zhong

Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization

Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive…

Machine Learning · Computer Science 2025-05-15 Xinhao Yao , Hongjin Qian , Xiaolin Hu , Gengze Xu , Wei Liu , Jian Luan , Bin Wang , Yong Liu

Latent-Condensed Transformer for Efficient Long Context Modeling

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately:…

Computation and Language · Computer Science 2026-04-17 Zeng You , Yaofo Chen , Qiuwu Chen , Ying Sun , Shuhai Zhang , Yingjian Li , Yaowei Wang , Mingkui Tan