Related papers: SparseAccelerate: Efficient Long-Context Inference…

FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill

In long-context large language model (LLM) inference, the prefill stage dominates computation due to self-attention over the complete input context. Sparse attention significantly reduces self-attention computation by limiting each token's…

Hardware Architecture · Computer Science 2026-02-25 Rakshith Jayanth , Viktor Prasanna

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at…

Computation and Language · Computer Science 2026-05-04 Dongwon Jo , Beomseok Kang , Jiwon Song , Jae-Joon Kim

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention…

Computation and Language · Computer Science 2024-10-31 Huiqiang Jiang , Yucheng Li , Chengruidong Zhang , Qianhui Wu , Xufang Luo , Surin Ahn , Zhenhua Han , Amir H. Abdi , Dongsheng Li , Chin-Yew Lin , Yuqing Yang , Lili Qiu

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA)…

Machine Learning · Computer Science 2026-04-10 Quantong Qiu , Zhiyi Hong , Yi Yang , Haitian Wang , Kebin Liu , Qingqing Dang , Juntao Li , Min Zhang

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent…

Computation and Language · Computer Science 2026-05-29 Siheng Xiong , Joe Zou , Faramarz Fekri , Yae Jee Cho

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic…

Machine Learning · Computer Science 2025-05-30 Yu Zhang , Dong Guo , Fang Wu , Guoliang Zhu , Dian Ding , Yiming Zhang

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer…

Computation and Language · Computer Science 2025-03-06 Lida Chen , Dong Xu , Chenxin An , Xintao Wang , Yikai Zhang , Jiangjie Chen , Zujie Liang , Feng Wei , Jiaqing Liang , Yanghua Xiao , Wei Wang

Accelerating Prefilling for Long-Context LLMs via Sparse Pattern Sharing

Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely…

Machine Learning · Computer Science 2025-05-27 Dan Peng , Zhihui Fu , Zewen Ye , Zhuoran Song , Jun Wang

Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference

Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent…

Computation and Language · Computer Science 2025-10-22 Siyuan Yan , Guo-Qing Jiang , Yuchen Zhang , Xiaoxing Ma , Ran Zhu , Chun Cao , Jingwei Xu

Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari

Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity

Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters,…

Machine Learning · Computer Science 2025-11-13 Susav Shrestha , Brad Settlemyer , Nikoli Dryden , Narasimha Reddy

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full…

Machine Learning · Computer Science 2025-02-19 Kan Zhu , Tian Tang , Qinyu Xu , Yile Gu , Zhichen Zeng , Rohan Kadekodi , Liangyu Zhao , Ang Li , Arvind Krishnamurthy , Baris Kasikci

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate…

Machine Learning · Computer Science 2025-03-03 Xunhao Lai , Jianqiao Lu , Yao Luo , Yiyuan Ma , Xun Zhou

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a…

Computation and Language · Computer Science 2026-04-29 Jiayi Yuan , Cameron Shinn , Kai Xu , Jingze Cui , George Klimiashvili , Guangxuan Xiao , Perkz Zheng , Bo Li , Yuxin Zhou , Zhouhai Ye , Weijie You , Tian Zheng , Dominic Brown , Pengbo Wang , Markus Hoehnerbach , Richard Cai , Julien Demouth , John D. Owens , Xia Hu , Song Han , Timmy Liu , Huizi Mao

Exploiting Sparsity for Long Context Inference: Million Token Contexts on Commodity GPUs

There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of…

Computation and Language · Computer Science 2025-02-13 Ryan Synk , Monte Hoover , John Kirchenbauer , Neel Jain , Alex Stein , Manli Shu , Josue Melendez Sanchez , Ramani Duraiswami , Tom Goldstein

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity…

Computation and Language · Computer Science 2025-09-04 Qianchao Zhu , Jiangfei Duan , Chang Chen , Siran Liu , Guanyu Feng , Xin Lv , Xiao Chuanfu , Dahua Lin , Chao Yang

Inference Time Context Sparsity: Illusion or Opportunity?

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become…

Artificial Intelligence · Computer Science 2026-05-26 Sahil Joshi , Prithvi Dixit , Agniva Chowdhury , Anshumali Shrivastava , Joseph E. Gonzalez , Ion Stoica , Kumar Krishna Agrawal , Aditya Desai

AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention

Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods…

Computation and Language · Computer Science 2026-04-10 Yuxuan Hu , Jianchao Tan , Jiaqi Zhang , Wen Zan , Pingwei Sun , Yifan Lu , Yerui Sun , Yuchen Xie , Xunliang Cai , Jing Zhang

HSR-Enhanced Sparse Attention Acceleration

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel…

Machine Learning · Computer Science 2025-02-25 Bo Chen , Yingyu Liang , Zhizhou Sha , Zhenmei Shi , Zhao Song

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu