Related papers: Trainable Log-linear Sparse Attention for Efficien…

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts:…

Machine Learning · Computer Science 2025-11-20 Jintao Zhang , Haoxu Wang , Kai Jiang , Shuo Yang , Kaiwen Zheng , Haocheng Xi , Ziteng Wang , Hongzhou Zhu , Min Zhao , Ion Stoica , Joseph E. Gonzalez , Jun Zhu , Jianfei Chen

VSA: Faster Video Diffusion with Trainable Sparse Attention

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Peiyuan Zhang , Yongqi Chen , Haofeng Huang , Will Lin , Zhengzhong Liu , Ion Stoica , Eric Xing , Hao Zhang

Bidirectional Sparse Attention for Faster Video Diffusion Training

Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Chenlu Zhan , Wen Li , Chuyu Shen , Jun Zhang , Suhui Wu , Hao Zhang

Long-Context Modeling with Dynamic Hierarchical Sparse Attention for Memory-Constrained LLM Inference

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent…

Computation and Language · Computer Science 2026-05-29 Siheng Xiong , Joe Zou , Faramarz Fekri , Yae Jee Cho

DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Jie Hu , Zixiang Gao , Yutong He , Kun Yuan

Dynamic Differential Linear Attention: Enhancing Linear Diffusion Transformer for High-Quality Image Generation

Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Boyuan Cao , Xingbo Yao , Chenhui Wang , Jiaxin Ye , Yujie Wei , Hongming Shan

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention…

Machine Learning · Computer Science 2023-02-01 Aosong Feng , Irene Li , Yuang Jiang , Rex Ying

LoSA: Locality Aware Sparse Attention for Block-Wise Diffusion Language Models

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context…

Computation and Language · Computer Science 2026-04-15 Haocheng Xi , Harman Singh , Yuezhou Hu , Coleman Hooper , Rishabh Tiwari , Aditya Tomar , Minjae Lee , Wonjun Kang , Michael Mahoney , Chenfeng Xu , Kurt Keutzer , Amir Gholami

Transformer Acceleration with Dynamic Sparse Attention

Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at…

Machine Learning · Computer Science 2021-10-22 Liu Liu , Zheng Qu , Zhaodong Chen , Yufei Ding , Yuan Xie

Punctuation-aware Hybrid Trainable Sparse Attention for Large Language Models

Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention…

Computation and Language · Computer Science 2026-01-07 Junxiang Qiu , Shuo Wang , Zhengsu Chen , Hengheng Zhang , Jinda Lu , Changcheng Li , Qi Tian

SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or…

Machine Learning · Computer Science 2026-02-16 Jintao Zhang , Haoxu Wang , Kai Jiang , Kaiwen Zheng , Youhe Jiang , Ion Stoica , Jianfei Chen , Jun Zhu , Joseph E. Gonzalez

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving…

Computation and Language · Computer Science 2025-02-28 Jingyang Yuan , Huazuo Gao , Damai Dai , Junyu Luo , Liang Zhao , Zhengyan Zhang , Zhenda Xie , Y. X. Wei , Lean Wang , Zhiping Xiao , Yuqing Wang , Chong Ruan , Ming Zhang , Wenfeng Liang , Wangding Zeng

PISA: Piecewise Sparse Attention Is Wiser for Efficient Diffusion Transformers

Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Haopeng Li , Shitong Shao , Wenliang Zhong , Zikai Zhou , Lichen Bai , Hui Xiong , Zeke Xie

Training-free and Adaptive Sparse Attention for Efficient Long Video Generation

Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K…

Computer Vision and Pattern Recognition · Computer Science 2025-03-03 Yifei Xia , Suhan Ling , Fangcheng Fu , Yujie Wang , Huixia Li , Xuefeng Xiao , Bin Cui

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yuxi Liu , Yipeng Hu , Zekun Zhang , Kunze Jiang , Kun Yuan

Gated Sparse Attention: Combining Computational Efficiency with Training Stability for Long-Context Language Models

The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that…

Artificial Intelligence · Computer Science 2026-01-23 Alfred Shen , Aaron Shen

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the…

Computation and Language · Computer Science 2026-05-19 Yuxiang Huang , Nuno M. T. Gonçalves , Federico Alvetreti , Lei Li , Xu Han , Edoardo M. Ponti , André F. T. Martins , Marcos V. Treviso

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Songhua Liu , Zhenxiong Tan , Xinchao Wang

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Lianghui Zhu , Zilong Huang , Bencheng Liao , Jun Hao Liew , Hanshu Yan , Jiashi Feng , Xinggang Wang

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu