English
Related papers

Related papers: Trainable Log-linear Sparse Attention for Efficien…

200 papers

In Diffusion Transformer (DiT) models, particularly for video generation, attention latency is a major bottleneck due to the long sequence length and the quadratic complexity. We find that attention weights can be separated into two parts:…

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Peiyuan Zhang , Yongqi Chen , Haofeng Huang , Will Lin , Zhengzhong Liu , Ion Stoica , Eric Xing , Hao Zhang

Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Chenlu Zhan , Wen Li , Chuyu Shen , Jun Zhang , Suhui Wu , Hao Zhang

The quadratic cost of attention limits the scalability of long-context LLMs, especially under limited hardware memory budgets. While attention is often sparse, existing static sparse methods cannot adapt to task- or input-dependent…

Computation and Language · Computer Science 2026-05-29 Siheng Xiong , Joe Zou , Faramarz Fekri , Yae Jee Cho

Diffusion transformers have achieved remarkable success in high-quality video generation, yet their reliance on spatiotemporal 3D full attention incurs prohibitive computational cost due to the quadratic complexity of attention. Block…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Jie Hu , Zixiang Gao , Yutong He , Kun Yuan

Diffusion transformers (DiTs) have emerged as a powerful architecture for high-fidelity image generation, yet the quadratic cost of self-attention poses a major scalability bottleneck. To address this, linear attention mechanisms have been…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Boyuan Cao , Xingbo Yao , Chenhui Wang , Jiaxin Ye , Yujie Wei , Hongming Shan

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention…

Machine Learning · Computer Science 2023-02-01 Aosong Feng , Irene Li , Yuang Jiang , Rex Ying

Block-wise diffusion language models (DLMs) generate multiple tokens in any order, offering a promising alternative to the autoregressive decoding pipeline. However, they still remain bottlenecked by memory-bound attention in long-context…

Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at…

Machine Learning · Computer Science 2021-10-22 Liu Liu , Zheng Qu , Zhaodong Chen , Yufei Ding , Yuan Xie

Attention serves as the fundamental mechanism for long-context modeling in large language models (LLMs), yet dense attention becomes structurally prohibitive for long sequences due to its quadratic complexity. Consequently, sparse attention…

Computation and Language · Computer Science 2026-01-07 Junxiang Qiu , Shuo Wang , Zhengsu Chen , Hengheng Zhang , Jinda Lu , Changcheng Li , Qi Tian

Sparse-Linear Attention (SLA) combines sparse and linear attention to accelerate diffusion models and has shown strong performance in video generation. However, (i) SLA relies on a heuristic split that assigns computations to the sparse or…

Machine Learning · Computer Science 2026-02-16 Jintao Zhang , Haoxu Wang , Kai Jiang , Kaiwen Zheng , Youhe Jiang , Ion Stoica , Jianfei Chen , Jun Zhu , Joseph E. Gonzalez

Long-context modeling is crucial for next-generation language models, yet the high computational cost of standard attention mechanisms poses significant computational challenges. Sparse attention offers a promising direction for improving…

Diffusion Transformers are fundamental for video and image generation, but their efficiency is bottlenecked by the quadratic complexity of attention. While block sparse attention accelerates computation by attending only critical key-value…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Haopeng Li , Shitong Shao , Wenliang Zhong , Zikai Zhou , Lichen Bai , Hui Xiong , Zeke Xie

Generating high-fidelity long videos with Diffusion Transformers (DiTs) is often hindered by significant latency, primarily due to the computational demands of attention mechanisms. For instance, generating an 8-second 720p video (110K…

Computer Vision and Pattern Recognition · Computer Science 2025-03-03 Yifei Xia , Suhan Ling , Fangcheng Fu , Yujie Wang , Huixia Li , Xuefeng Xiao , Bin Cui

While Diffusion Transformers (DiTs) have achieved notable progress in video generation, this long-sequence generation task remains constrained by the quadratic complexity inherent to self-attention mechanisms, creating significant barriers…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yuxi Liu , Yipeng Hu , Zekun Zhang , Kunze Jiang , Kun Yuan

The computational burden of attention in long-context language models has motivated two largely independent lines of work: sparse attention mechanisms that reduce complexity by attending to selected tokens, and gated attention variants that…

Artificial Intelligence · Computer Science 2026-01-23 Alfred Shen , Aaron Shen

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the…

Computation and Language · Computer Science 2026-05-19 Yuxiang Huang , Nuno M. T. Gonçalves , Federico Alvetreti , Lei Li , Xu Han , Edoardo M. Ponti , André F. T. Martins , Marcos V. Treviso

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Songhua Liu , Zhenxiong Tan , Xinchao Wang

Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with quadratic…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Lianghui Zhu , Zilong Huang , Bencheng Liao , Jun Hao Liew , Hanshu Yan , Jiashi Feng , Xinggang Wang

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu
‹ Prev 1 2 3 10 Next ›