Related papers: Efficient Sparse Attention needs Adaptive Token Re…

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which…

Machine Learning · Computer Science 2025-11-05 Chaofan Lin , Jiaming Tang , Shuo Yang , Hanshuo Wang , Tian Tang , Boyu Tian , Ion Stoica , Song Han , Mingyu Gao

Anchor Attention, Small Cache: Code Generation with Large Language Models

The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for…

Software Engineering · Computer Science 2024-11-12 Xiangyu Zhang , Yu Zhou , Guang Yang , Harald C. Gall , Taolue Chen

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full…

Machine Learning · Computer Science 2025-02-19 Kan Zhu , Tian Tang , Qinyu Xu , Yile Gu , Zhichen Zeng , Rohan Kadekodi , Liangyu Zhao , Ang Li , Arvind Krishnamurthy , Baris Kasikci

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at…

Computation and Language · Computer Science 2026-05-04 Dongwon Jo , Beomseok Kang , Jiwon Song , Jae-Joon Kim

Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within…

Computation and Language · Computer Science 2026-01-29 Zecheng Tang , Quantong Qiu , Yi Yang , Zhiyi Hong , Haiya Xiang , Kebin Liu , Qingqing Dang , Juntao Li , Min Zhang

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate…

Computation and Language · Computer Science 2025-11-06 Yuerong Song , Xiaoran Liu , Ruixiao Li , Zhigeng Liu , Zengfeng Huang , Qipeng Guo , Ziwei He , Xipeng Qiu

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer…

Machine Learning · Computer Science 2024-10-08 Lijie Yang , Zhihao Zhang , Zhuofu Chen , Zikun Li , Zhihao Jia

Training-free Context-adaptive Attention for Efficient Long Context Modeling

Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of natural language processing tasks. These capabilities stem primarily from the self-attention mechanism, which enables modeling of long-range…

Computation and Language · Computer Science 2026-01-05 Zeng You , Yaofo Chen , Shuhai Zhang , Zhijie Qiu , Tingyu Wu , Yingjian Li , Yaowei Wang , Mingkui Tan

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer…

Computation and Language · Computer Science 2025-03-06 Lida Chen , Dong Xu , Chenxin An , Xintao Wang , Yikai Zhang , Jiangjie Chen , Zujie Liang , Feng Wei , Jiaqing Liang , Yanghua Xiao , Wei Wang

Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To…

Computation and Language · Computer Science 2025-10-27 Mutian He , Philip N. Garner

AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models

Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache…

Computation and Language · Computer Science 2025-06-05 Yifeng Gu , Zicong Jiang , Jianxiu Jin , Kailing Guo , Ziyang Zhang , Xiangmin Xu

Loki: Low-rank Keys for Efficient Sparse Attention

Inference on large language models (LLMs) can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in LLM inference contributes…

Machine Learning · Computer Science 2024-11-11 Prajwal Singhania , Siddharth Singh , Shwai He , Soheil Feizi , Abhinav Bhatele

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity…

Computation and Language · Computer Science 2025-09-04 Qianchao Zhu , Jiangfei Duan , Chang Chen , Siran Liu , Guanyu Feng , Xin Lv , Xiao Chuanfu , Dahua Lin , Chao Yang

SparQ Attention: Bandwidth-Efficient LLM Inference

The computational difficulties of large language model (LLM) inference remain a significant obstacle to their widespread deployment. The need for many applications to support long input sequences and process them in large batches typically…

Machine Learning · Computer Science 2024-09-05 Luka Ribar , Ivan Chelombiev , Luke Hudlass-Galley , Charlie Blake , Carlo Luschi , Douglas Orr

Efficient Attention Mechanisms for Large Language Models: A Survey

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the…

Computation and Language · Computer Science 2026-01-28 Piotr Nawrot , Robert Li , Renjie Huang , Sebastian Ruder , Kelly Marchisio , Edoardo M. Ponti

Multipole Attention for Efficient Long Context Reasoning

Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long…

Computation and Language · Computer Science 2025-12-16 Coleman Hooper , Sebastian Zhao , Luca Manolache , Sehoon Kim , Michael W. Mahoney , Yakun Sophia Shao , Kurt Keutzer , Amir Gholami

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined…

Computation and Language · Computer Science 2025-06-16 Hanzhi Zhang , Heng Fan , Kewei Sha , Yan Huang , Yunhe Feng

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

UNIQUE: Universal Top-k Sparse Attention for Training-free Inference and Sparsity-aware Training

Long-context inference in large language models (LLMs) is bottlenecked by the linear growth of the self-attention key-value (KV) cache. Top-k sparse attention alleviates this by loading only a small fraction of the KV cache, but accurately…

Computation and Language · Computer Science 2026-05-28 Keqi Deng , Shaoshi Ling , Ruchao Fan , Jinyu Li