Related papers: FlashBlock: Attention Caching for Efficient Long-C…

BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Zeyu Zhang , Shuning Chang , Yuanyu He , Yizeng Han , Jiasheng Tang , Fan Wang , Bohan Zhuang

Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models

Diffusion language models offer unique benefits over autoregressive models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work,…

Machine Learning · Computer Science 2025-05-20 Marianne Arriola , Aaron Gokaslan , Justin T. Chiu , Zhihan Yang , Zhixuan Qi , Jiaqi Han , Subham Sekhar Sahoo , Volodymyr Kuleshov

FlashPrefill: Instantaneous Pattern Discovery and Thresholding for Ultra-Fast Long-Context Prefilling

Long-context modeling is a pivotal capability for Large Language Models, yet the quadratic complexity of attention remains a critical bottleneck, particularly during the compute-intensive prefilling phase. While various sparse attention…

Computation and Language · Computer Science 2026-03-09 Qihang Fan , Huaibo Huang , Zhiying Wu , Juqiu Wang , Bingning Wang , Ran He

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A…

Computer Vision and Pattern Recognition · Computer Science 2024-01-15 Felix Wimbauer , Bichen Wu , Edgar Schoenfeld , Xiaoliang Dai , Ji Hou , Zijian He , Artsiom Sanakoyeu , Peizhao Zhang , Sam Tsai , Jonas Kohler , Christian Rupprecht , Daniel Cremers , Peter Vajda , Jialiang Wang

A Survey on Cache Methods in Diffusion Models: Toward Efficient Multi-Modal Generation

Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to…

Machine Learning · Computer Science 2025-11-04 Jiacheng Liu , Xinyu Wang , Yuqi Lin , Zhikai Wang , Peiru Wang , Peiliang Cai , Qinming Zhou , Zhengan Yan , Zexuan Yan , Zhengyi Shi , Chang Zou , Yue Ma , Linfeng Zhang

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of…

Machine Learning · Computer Science 2026-04-21 Nazmul Takbir , Hamidreza Alikhani , Nikil Dutt , Sangeetha Abdu Jyothi

FlexCache: Flexible Approximate Cache System for Video Diffusion

Text-to-Video applications receive increasing attention from the public. Among these, diffusion models have emerged as the most prominent approach, offering impressive quality in visual content generation. However, it still suffers from…

Multimedia · Computer Science 2025-01-09 Desen Sun , Henry Tian , Tim Lu , Sihang Liu

MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively…

Machine Learning · Computer Science 2026-02-17 Omin Kwon , Yeonjae Kim , Doyeon Kim , Minseo Kim , Yeonhong Park , Jae W. Lee

FreqCa: Accelerating Diffusion Models via Frequency-Aware Caching

The application of diffusion transformers is suffering from their significant inference costs. Recently, feature caching has been proposed to solve this problem by reusing features from previous timesteps, thereby skipping computation in…

Machine Learning · Computer Science 2025-10-13 Jiacheng Liu , Peiliang Cai , Qinming Zhou , Yuqi Lin , Deyang Kong , Benhao Huang , Yupei Pan , Haowen Xu , Chang Zou , Junshu Tang , Shikang Zheng , Linfeng Zhang

Block Transformer: Global-to-Local Language Modeling for Fast Inference

We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of…

Computation and Language · Computer Science 2024-11-04 Namgyu Ho , Sangmin Bae , Taehyeon Kim , Hyunjik Jo , Yireun Kim , Tal Schuster , Adam Fisch , James Thorne , Se-Young Yun

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Dvir Samuel , Issar Tzachor , Matan Levy , Micahel Green , Gal Chechik , Rami Ben-Ari

An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

Long-context inference increasingly operates over CPU-resident KV caches, either because decoding-time KV states exceed GPU memory capacity or because disaggregated prefill-decode systems place KV data in host memory. Although block-sparse…

Machine Learning · Computer Science 2026-05-11 Feiyu Yao , Zhixiong Niu , Xiaqing Li , Yongqiang Xiong , Juan Fang , Qian Wang

CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Shrenik Patel , Daivik Patel

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

LongFlow: Efficient KV Cache Compression for Reasoning Models

Recent reasoning models such as OpenAI-o1 and DeepSeek-R1 have shown strong performance on complex tasks including mathematical reasoning and code generation. However, this performance gain comes with substantially longer output sequences,…

Machine Learning · Computer Science 2026-04-28 Yi Su , Zhenxu Tian , Dan Qiao , Yuechi Zhou , Juntao Li , Min Zhang

Retrospective Sparse Attention for Efficient Long-Context Generation

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory…

Computation and Language · Computer Science 2026-05-21 Seonghwan Choi , Beomseok Kang , Dongwon Jo , Jae-Joon Kim

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation…

Robotics · Computer Science 2026-05-14 Ruiheng Wang , Shuanghao Bai , Haoran Zhang , Badong Chen , Xiangyu Xu

Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation

We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Boxun Xu , Yuming Du , Zichang Liu , Siyu Yang , Ziyang Jiang , Siqi Yan , Rajasi Saha , Albert Pumarola , Wenchen Wang , Peng Li

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

Diffusion language models offer parallel token generation and inherent bidirectionality, promising more efficient and powerful sequence modeling compared to autoregressive approaches. However, state-of-the-art diffusion models (e.g., Dream…

Computation and Language · Computer Science 2025-10-10 Zhanqiu Hu , Jian Meng , Yash Akhauri , Mohamed S. Abdelfattah , Jae-sun Seo , Zhiru Zhang , Udit Gupta

dKV-Cache: The Cache for Diffusion Language Models

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive…

Computation and Language · Computer Science 2025-05-22 Xinyin Ma , Runpeng Yu , Gongfan Fang , Xinchao Wang