Related papers: SparseCoder: Advancing Source Code Analysis with S…

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization

Code summarization aims to generate natural language descriptions of source code, facilitating programmers to understand and maintain it rapidly. While previous code summarization efforts have predominantly focused on method-level, this…

Software Engineering · Computer Science 2024-01-29 Yanlin Wang , Yanxian Huang , Daya Guo , Hongyu Zhang , Zibin Zheng

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained…

Computation and Language · Computer Science 2024-12-10 James Vo

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to…

Machine Learning · Computer Science 2025-12-02 Yilong Zhao , Jiaming Tang , Kan Zhu , Zihao Ye , Chi-Chih Chang , Chaofan Lin , Jongseok Park , Guangxuan Xiao , Mohamed S. Abdelfattah , Mingyu Gao , Baris Kasikci , Song Han , Ion Stoica

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t.…

Machine Learning · Computer Science 2023-06-05 Matteo Pagliardini , Daniele Paliotta , Martin Jaggi , François Fleuret

Understanding Long Programming Languages with Structure-Aware Sparse Attention

Programming-based Pre-trained Language Models (PPLMs) such as CodeBERT have achieved great success in many downstream code-related tasks. Since the memory and computational complexity of self-attention in the Transformer grow quadratically…

Computation and Language · Computer Science 2022-05-30 Tingting Liu , Chengyu Wang , Cen Chen , Ming Gao , Aoying Zhou

Learned Token Pruning for Transformers

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at…

Computation and Language · Computer Science 2026-05-04 Dongwon Jo , Beomseok Kang , Jiwon Song , Jae-Joon Kim

Unleashing the Potential of Sparse Attention on Long-term Behaviors for CTR Prediction

In recent years, the success of large language models (LLMs) has driven the exploration of scaling laws in recommender systems. However, models that demonstrate scaling laws are actually challenging to deploy in industrial settings for…

Information Retrieval · Computer Science 2026-01-27 Weijiang Lai , Beihong Jin , Di Zhang , Siru Chen , Jiongyan Zhang , Yuhang Gou , Jian Dong , Xingxing Wang

Predicting Attention Sparsity in Transformers

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact…

Computation and Language · Computer Science 2022-04-22 Marcos Treviso , António Góis , Patrick Fernandes , Erick Fonseca , André F. T. Martins

TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Large language models (LLMs) have driven significant advancements across diverse NLP tasks, with long-context models gaining prominence for handling extended inputs. However, the expanding key-value (KV) cache size required by Transformer…

Machine Learning · Computer Science 2024-10-08 Lijie Yang , Zhihao Zhang , Zhuofu Chen , Zikun Li , Zhihao Jia

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by…

Computation and Language · Computer Science 2021-09-03 Chuhan Wu , Fangzhao Wu , Tao Qi , Binxing Jiao , Daxin Jiang , Yongfeng Huang , Xing Xie

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable…

Machine Learning · Computer Science 2022-08-23 Hongwu Peng , Shaoyi Huang , Shiyang Chen , Bingbing Li , Tong Geng , Ang Li , Weiwen Jiang , Wujie Wen , Jinbo Bi , Hang Liu , Caiwen Ding

SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning

The attention mechanism is becoming increasingly popular in Natural Language Processing (NLP) applications, showing superior performance than convolutional and recurrent architectures. However, attention becomes the compution bottleneck…

Hardware Architecture · Computer Science 2024-07-22 Hanrui Wang , Zhekai Zhang , Song Han

Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate…

Machine Learning · Computer Science 2025-02-10 Nathaniel Tomczak , Sanmukh Kuppannagari

Sparser Block-Sparse Attention via Token Permutation

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence…

Computation and Language · Computer Science 2026-05-25 Xinghao Wang , Pengyu Wang , Dong Zhang , Chenkun Tan , Shaojun Zhou , Zhaoxiang Liu , Shiguo Lian , Fangxu Liu , Kai Song , Xipeng Qiu

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the…

Computation and Language · Computer Science 2026-01-28 Piotr Nawrot , Robert Li , Renjie Huang , Sebastian Ruder , Kelly Marchisio , Edoardo M. Ponti

In-Context Compositional Learning via Sparse Coding Transformer

Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target…

Machine Learning · Computer Science 2025-11-26 Wei Chen , Jingxi Yu , Zichen Miao , Qiang Qiu

Sparse Attention-Based Neural Networks for Code Classification

Categorizing source codes accurately and efficiently is a challenging problem in real-world programming education platform management. In recent years, model-based approaches utilizing abstract syntax trees (ASTs) have been widely applied…

Programming Languages · Computer Science 2023-11-14 Ziyang Xiang , Zaixi Zhang , Qi Liu

Scaling Attention via Feature Sparsity

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these…

Machine Learning · Computer Science 2026-03-31 Yan Xie , Tiansheng Wen , Tangda Huang , Bo Chen , Chenyu You , Stefanie Jegelka , Yifei Wang