English
Related papers

Related papers: SparseAccelerate: Efficient Long-Context Inference…

200 papers

In long-context large language model (LLM) inference, the prefill stage dominates computation due to self-attention over the complete input context. Sparse attention significantly reduces self-attention computation by limiting each token's…

Hardware Architecture · Computer Science 2026-02-25 Rakshith Jayanth , Viktor Prasanna

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at…

Computation and Language · Computer Science 2026-05-04 Dongwon Jo , Beomseok Kang , Jiwon Song , Jae-Joon Kim

The computational challenges of Large Language Model (LLM) inference remain a significant barrier to their widespread deployment, especially as prompt lengths continue to increase. Due to the quadratic complexity of the attention…

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA)…

Machine Learning · Computer Science 2026-04-10 Quantong Qiu , Zhiyi Hong , Yi Yang , Haitian Wang , Kebin Liu , Qingqing Dang , Juntao Li , Min Zhang

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent…

Computation and Language · Computer Science 2026-05-29 Siheng Xiong , Joe Zou , Faramarz Fekri , Yae Jee Cho

Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic…

Machine Learning · Computer Science 2025-05-30 Yu Zhang , Dong Guo , Fang Wu , Guoliang Zhu , Dian Ding , Yiming Zhang

Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer…

Computation and Language · Computer Science 2025-03-06 Lida Chen , Dong Xu , Chenxin An , Xintao Wang , Yikai Zhang , Jiangjie Chen , Zujie Liang , Feng Wei , Jiaqing Liang , Yanghua Xiao , Wei Wang

Sparse attention methods exploit the inherent sparsity in attention to speed up the prefilling phase of long-context inference, mitigating the quadratic complexity of full attention computation. While existing sparse attention methods rely…

Machine Learning · Computer Science 2025-05-27 Dan Peng , Zhihui Fu , Zewen Ye , Zhuoran Song , Jun Wang

Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent…

Computation and Language · Computer Science 2025-10-22 Siyuan Yan , Guo-Qing Jiang , Yuchen Zhang , Xiaoxing Ma , Ran Zhu , Chun Cao , Jingwei Xu

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari

Accelerating large language model (LLM) inference is critical for real-world deployments requiring high throughput and low latency. Contextual sparsity, where each token dynamically activates only a small subset of the model parameters,…

Machine Learning · Computer Science 2025-11-13 Susav Shrestha , Brad Settlemyer , Nikoli Dryden , Narasimha Reddy

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full…

Machine Learning · Computer Science 2025-02-19 Kan Zhu , Tian Tang , Qinyu Xu , Yile Gu , Zhichen Zeng , Rohan Kadekodi , Liangyu Zhao , Ang Li , Arvind Krishnamurthy , Baris Kasikci

Large language models (LLMs) encounter computational challenges during long-sequence inference, especially in the attention pre-filling phase, where the complexity grows quadratically with the prompt length. Previous efforts to mitigate…

Machine Learning · Computer Science 2025-03-03 Xunhao Lai , Jianqiao Lu , Yao Luo , Yiyuan Ma , Xun Zhou

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a…

There is growing demand for performing inference with hundreds of thousands of input tokens on trained transformer models. Inference at this extreme scale demands significant computational resources, hindering the application of…

Computation and Language · Computer Science 2025-02-13 Ryan Synk , Monte Hoover , John Kirchenbauer , Neel Jain , Alex Stein , Manli Shu , Josue Melendez Sanchez , Ramani Duraiswami , Tom Goldstein

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity…

Computation and Language · Computer Science 2025-09-04 Qianchao Zhu , Jiangfei Duan , Chang Chen , Siran Liu , Guanyu Feng , Xin Lv , Xiao Chuanfu , Dahua Lin , Chao Yang

Sparsity has long been a central theme in LLM efficiency, but its role in context processing remains unresolved. As LLM workloads shift toward longer contexts and agentic interactions, the compute and memory bottlenecks of attention become…

Long-context inference in LLMs faces the dual challenges of quadratic attention complexity and prohibitive KV cache memory. While token-level sparse attention offers superior accuracy, its indexing overhead is costly; block-level methods…

Computation and Language · Computer Science 2026-04-10 Yuxuan Hu , Jianchao Tan , Jiaqi Zhang , Wen Zan , Pingwei Sun , Yifan Lu , Yerui Sun , Yuchen Xie , Xunliang Cai , Jing Zhang

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel…

Machine Learning · Computer Science 2025-02-25 Bo Chen , Yingyu Liang , Zhizhou Sha , Zhenmei Shi , Zhao Song

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu
‹ Prev 1 2 3 10 Next ›