Related papers: NPU Design for Diffusion Language Model Inference

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked…

Machine Learning · Computer Science 2025-06-10 Zhiyuan Liu , Yicun Yang , Yaojie Zhang , Junjie Chen , Chang Zou , Qingyuan Wei , Shaobo Wang , Linfeng Zhang

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source…

Machine Learning · Computer Science 2025-08-14 Xu Wang , Chenkai Xu , Yijie Jin , Jiachun Jin , Hao Zhang , Zhijie Deng

d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard…

Computation and Language · Computer Science 2026-02-17 Yuchu Jiang , Yue Cai , Xiangzhong Luo , Jiale Fu , Jiarui Wang , Chonghan Liu , Xu Yang

BlockVLA: Accelerating Autoregressive VLA via Block Diffusion Finetuning

While autoregressive (AR) Vision-Language-Action (VLA) models have demonstrated formidable reasoning capabilities in robotic tasks, their sequential decoding process often incurs high inference latency and may amplify error accumulation…

Robotics · Computer Science 2026-05-14 Ruiheng Wang , Shuanghao Bai , Haoran Zhang , Badong Chen , Xiangyu Xu

Fast-dLLM v2: Efficient Block-Diffusion LLM

Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency. In this work, we propose Fast-dLLM v2,…

Computation and Language · Computer Science 2025-10-01 Chengyue Wu , Hao Zhang , Shuchen Xue , Shizhe Diao , Yonggan Fu , Zhijian Liu , Pavlo Molchanov , Ping Luo , Song Han , Enze Xie

dInfer: An Efficient Inference Framework for Diffusion Language Models

Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet…

Computation and Language · Computer Science 2025-10-23 Yuxin Ma , Lun Du , Lanning Wei , Kun Chen , Qian Xu , Kangyu Wang , Guofeng Feng , Guoshan Lu , Lin Liu , Xiaojing Qi , Xinyuan Zhang , Zhen Tao , Haibo Feng , Ziyun Jiang , Ying Xu , Zenan Huang , Yihong Zhuang , Haokai Xu , Jiaqi Hu , Zhenzhong Lan , Junbo Zhao , Jianguo Li , Da Zheng

Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics…

Computation and Language · Computer Science 2026-04-13 Chengyue Wu , Shiyi Lan , Yonggan Fu , Sensen Gao , Jin Wang , Jincheng Yu , Jose M. Alvarez , Pavlo Molchanov , Ping Luo , Song Han , Ligeng Zhu , Enze Xie

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

Diffusion-based decoding has recently emerged as an appealing alternative to autoregressive (AR) generation, offering the potential to update multiple tokens in parallel and reduce latency. However, diffusion vision language models (dVLMs)…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Lunbin Zeng , Jingfeng Yao , Bencheng Liao , Hongyuan Tao , Wenyu Liu , Xinggang Wang

dKV-Cache: The Cache for Diffusion Language Models

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive…

Computation and Language · Computer Science 2025-05-22 Xinyin Ma , Runpeng Yu , Gongfan Fang , Xinchao Wang

d3LLM: Ultra-Fast Diffusion LLM using Pseudo-Trajectory Distillation

Diffusion large language models (dLLMs) offer capabilities beyond those of autoregressive (AR) LLMs, such as parallel decoding and random-order generation. However, realizing these benefits in practice is non-trivial, as dLLMs inherently…

Machine Learning · Computer Science 2026-01-30 Yu-Yang Qian , Junda Su , Lanxiang Hu , Peiyuan Zhang , Zhijie Deng , Peng Zhao , Hao Zhang

Fast On-device LLM Inference with NPUs

On-device inference for Large Language Models (LLMs), driven by increasing privacy concerns and advancements of mobile-sized models, has gained significant interest. However, even mobile-sized LLMs (e.g., Gemma-2B) encounter unacceptably…

Artificial Intelligence · Computer Science 2024-12-17 Daliang Xu , Hao Zhang , Liming Yang , Ruiqi Liu , Gang Huang , Mengwei Xu , Xuanzhe Liu

A Survey on Diffusion Language Models

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent…

Computation and Language · Computer Science 2025-12-08 Tianyi Li , Mingda Chen , Bowei Guo , Zhiqiang Shen

Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and code generation. Autoregressive Language Models (ARMs), which generate…

Machine Learning · Computer Science 2025-12-16 Minseo Kim , Coleman Hooper , Aditya Tomar , Chenfeng Xu , Mehrdad Farajtabar , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

Sequential Diffusion Language Models

Diffusion language models (DLMs) have strong theoretical efficiency but are limited by fixed-length decoding and incompatibility with key-value (KV) caches. Block diffusion mitigates these issues, yet still enforces a fixed block size and…

Computation and Language · Computer Science 2025-09-30 Yangzhou Liu , Yue Cao , Hao Li , Gen Luo , Zhe Chen , Weiyun Wang , Xiaobo Liang , Biqing Qi , Lijun Wu , Changyao Tian , Yanting Zhang , Yuqiang Li , Tong Lu , Yu Qiao , Jifeng Dai , Wenhai Wang

From Next-Token to Next-Block: A Principled Adaptation Path for Diffusion LLMs

Diffusion Language Models (DLMs) enable fast generation, yet training large DLMs from scratch is costly. As a practical shortcut, adapting off-the-shelf Auto-Regressive (AR) model weights into a DLM could quickly equip the DLM with strong…

Computation and Language · Computer Science 2026-02-02 Yuchuan Tian , Yuchen Liang , Shuo Zhang , Yingte Shu , Guangwen Yang , Wei He , Sibo Fang , Tianyu Guo , Kai Han , Chao Xu , Hanting Chen , Xinghao Chen , Yunhe Wang

A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs

Autoregressive (AR) language models build representations incrementally via left-to-right prediction, while diffusion language models (dLLMs) are trained through full-sequence denoising. Although recent dLLMs match AR performance, whether…

Computation and Language · Computer Science 2026-05-11 Raghavv Goel , Risheek Garrepalli , Sudhanshu Agrawal , Chris Lott , Mingu Lee , Fatih Porikli

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities. However, the practical inference speed of open-sourced Diffusion LLMs often lags behind…

Computation and Language · Computer Science 2025-07-04 Chengyue Wu , Hao Zhang , Shuchen Xue , Zhijian Liu , Shizhe Diao , Ligeng Zhu , Ping Luo , Song Han , Enze Xie

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction

Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive quadratic computational complexity and memory overhead during inference. Current caching techniques accelerate…

Computation and Language · Computer Science 2025-11-06 Yuerong Song , Xiaoran Liu , Ruixiao Li , Zhigeng Liu , Zengfeng Huang , Qipeng Guo , Ziwei He , Xipeng Qiu

Lost in Diffusion: Uncovering Hallucination Patterns and Failure Modes in Diffusion Large Language Models

While Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive paradigm comparable to autoregressive (AR) models, their faithfulness, specifically regarding hallucination, remains largely underexplored. To…

Computation and Language · Computer Science 2026-04-14 Zhengnan Guo , Fei Tan

CDLM: Consistency Diffusion Language Models For Faster Sampling

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language…

Machine Learning · Computer Science 2026-02-23 Minseo Kim , Chenfeng Xu , Coleman Hooper , Harman Singh , Ben Athiwaratkun , Ce Zhang , Kurt Keutzer , Amir Gholami