Related papers: SPLAT: A framework for optimised GPU code-generati…

Understanding Long Programming Languages with Structure-Aware Sparse Attention

Programming-based Pre-trained Language Models (PPLMs) such as CodeBERT have achieved great success in many downstream code-related tasks. Since the memory and computational complexity of self-attention in the Transformer grow quadratically…

Computation and Language · Computer Science 2022-05-30 Tingting Liu , Chengyu Wang , Cen Chen , Ming Gao , Aoying Zhou

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a…

Computation and Language · Computer Science 2026-04-29 Jiayi Yuan , Cameron Shinn , Kai Xu , Jingze Cui , George Klimiashvili , Guangxuan Xiao , Perkz Zheng , Bo Li , Yuxin Zhou , Zhouhai Ye , Weijie You , Tian Zheng , Dominic Brown , Pengbo Wang , Markus Hoehnerbach , Richard Cai , Julien Demouth , John D. Owens , Xia Hu , Song Han , Timmy Liu , Huizi Mao

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or…

Machine Learning · Computer Science 2026-02-16 Jintao Zhang , Haoxu Wang , Kai Jiang , Kaiwen Zheng , Youhe Jiang , Ion Stoica , Jianfei Chen , Jun Zhu , Joseph E. Gonzalez

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention…

Computation and Language · Computer Science 2026-01-07 Junxiang Qiu , Shuo Wang , Zhengsu Chen , Hengheng Zhang , Jinda Lu , Changcheng Li , Qi Tian

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving…

Computation and Language · Computer Science 2025-02-28 Jingyang Yuan , Huazuo Gao , Damai Dai , Junyu Luo , Liang Zhao , Zhengyan Zhang , Zhenda Xie , Y. X. Wei , Lean Wang , Zhiping Xiao , Yuqing Wang , Chong Ruan , Ming Zhang , Wenfeng Liang , Wangding Zeng

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent…

Computation and Language · Computer Science 2026-05-29 Siheng Xiong , Joe Zou , Faramarz Fekri , Yae Jee Cho

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA)…

Machine Learning · Computer Science 2026-04-10 Quantong Qiu , Zhiyi Hong , Yi Yang , Haitian Wang , Kebin Liu , Qingqing Dang , Juntao Li , Min Zhang

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H N^2) that grows quadratically…

Machine Learning · Computer Science 2025-12-01 Mingkuan Zhao , Wentao Hu , Jiayin Wang , Xin Lai , Tianchen Huang , Yuheng Min , Rui Yan , Xiaoyan Zhu

Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models

The quadratic computational complexity of MultiHead SelfAttention (MHSA) remains a fundamental bottleneck in scaling Large Language Models (LLMs) for longcontext tasks. While sparse and linearized attention mechanisms attempt to mitigate…

Computation and Language · Computer Science 2025-12-19 Caner Erden

Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation

The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse…

Computer Vision and Pattern Recognition · Computer Science 2025-08-19 Qirui Li , Guangcong Zheng , Qi Zhao , Jie Li , Bin Dong , Yiwu Yao , Xi Li

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts:…

Machine Learning · Computer Science 2025-11-20 Jintao Zhang , Haoxu Wang , Kai Jiang , Shuo Yang , Kaiwen Zheng , Haocheng Xi , Ziteng Wang , Hongzhou Zhu , Min Zhao , Ion Stoica , Joseph E. Gonzalez , Jun Zhu , Jianfei Chen

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Recent advance in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), one state-of-the-art…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-14 Ran Yan , Youhe Jiang , Zhuoming Chen , Haohui Mai , Beidi Chen , Binhang Yuan

MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of…

Computation and Language · Computer Science 2026-04-14 Yu Chen , Runkai Chen , Sheng Yi , Xinda Zhao , Xiaohong Li , Jianjin Zhang , Jun Sun , Chuanrui Hu , Yunyun Han , Lidong Bing , Yafeng Deng , Tianqiao Chen

SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs

As Large Language Models (LLMs) scale to longer context windows, the computational cost of attention mechanisms, which traditionally grows quadratically with input length, presents a critical challenge for real-time and memory-constrained…

Computation and Language · Computer Science 2024-12-10 James Vo

AST-MHSA : Code Summarization using Multi-Head Self-Attention

Code summarization aims to generate concise natural language descriptions for source code. The prevailing approaches adopt transformer-based encoder-decoder architectures, where the Abstract Syntax Tree (AST) of the source code is utilized…

Computation and Language · Computer Science 2023-08-11 Yeshwanth Nagaraj , Ujjwal Gupta

HRSAM: Efficient Interactive Segmentation in High-Resolution Images

The Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images. This requires downsampling to meet GPU constraints, sacrificing the fine-grained details needed…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 You Huang , Wenbin Lai , Jiayi Ji , Liujuan Cao , Shengchuan Zhang , Rongrong Ji

SPAT: Sensitivity-based Multihead-attention Pruning on Time Series Forecasting Models

Attention-based architectures have achieved superior performance in multivariate time series forecasting but are computationally expensive. Techniques such as patching and adaptive masking have been developed to reduce their sizes and…

Machine Learning · Computer Science 2025-05-14 Suhan Guo , Jiahong Deng , Mengjun Yi , Furao Shen , Jian Zhao

Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving

Processing long contexts has become a critical capability for modern large language models (LLMs). However, serving long-context LLMs comes with significant inference costs due to the high memory overhead of the key-value (KV) cache.…

Machine Learning · Computer Science 2025-03-04 Qihui Zhou , Peiqi Yin , Pengfei Zuo , James Cheng

Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs

Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially,…

Computation and Language · Computer Science 2025-02-27 Yiheng Yang , Yujie Wang , Chi Ma , Lei Yu , Emmanuele Chersoni , Chu-Ren Huang

A Unified Sparse Attention via Multi-Granularity Compression

Efficient long-context understanding and reasoning are increasingly vital for large language model (LLM) applications such as multi-turn dialogue and program analysis. However, the core self-attention mechanism scales quadratically with…

Computation and Language · Computer Science 2025-12-17 Siran Liu , Zane Cao , Yongchao He