English
Related papers

Related papers: FlashBlock: Attention Caching for Efficient Long-C…

200 papers

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Zeyu Zhang , Shuning Chang , Yuanyu He , Yizeng Han , Jiasheng Tang , Fan Wang , Bohan Zhuang

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work,…

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention…

Computation and Language · Computer Science 2026-03-09 Qihang Fan , Huaibo Huang , Zhiying Wu , Juqiu Wang , Bingning Wang , Ran He

Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A…

Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to…

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of…

Machine Learning · Computer Science 2026-04-21 Nazmul Takbir , Hamidreza Alikhani , Nikil Dutt , Sangeetha Abdu Jyothi

Text-to-Video applications receive increasing attention from the public. Among these, diffusion models have emerged as the most prominent approach, offering impressive quality in visual content generation. However, it still suffers from…

Multimedia · Computer Science 2025-01-09 Desen Sun , Henry Tian , Tim Lu , Sihang Liu

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively…

Machine Learning · Computer Science 2026-02-17 Omin Kwon , Yeonjae Kim , Doyeon Kim , Minseo Kim , Yeonhong Park , Jae W. Lee

The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in…

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of…

Computation and Language · Computer Science 2024-11-04 Namgyu Ho , Sangmin Bae , Taehyeon Kim , Hyunjik Jo , Yireun Kim , Tal Schuster , Adam Fisch , James Thorne , Se-Young Yun

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Dvir Samuel , Issar Tzachor , Matan Levy , Micahel Green , Gal Chechik , Rami Ben-Ari

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse…

Machine Learning · Computer Science 2026-05-11 Feiyu Yao , Zhixiong Niu , Xiaqing Li , Yongqiang Xiong , Juan Fang , Qian Wang

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Shrenik Patel , Daivik Patel

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences,…

Machine Learning · Computer Science 2026-04-28 Yi Su , Zhenxu Tian , Dan Qiao , Yuechi Zhou , Juntao Li , Min Zhang

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation…

Robotics · Computer Science 2026-05-14 Ruiheng Wang , Shuanghao Bai , Haoran Zhang , Badong Chen , Xiangyu Xu

We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Boxun Xu , Yuming Du , Zichang Liu , Siyu Yang , Ziyang Jiang , Siqi Yan , Rajasi Saha , Albert Pumarola , Wenchen Wang , Peng Li

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream…

Computation and Language · Computer Science 2025-10-10 Zhanqiu Hu , Jian Meng , Yash Akhauri , Mohamed S. Abdelfattah , Jae-sun Seo , Zhiru Zhang , Udit Gupta

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive…

Computation and Language · Computer Science 2025-05-22 Xinyin Ma , Runpeng Yu , Gongfan Fang , Xinchao Wang
‹ Prev 1 2 3 10 Next ›