相关论文: Accelerating Speculative Decoding with Block Diffu…

DFlash: Block Diffusion for Flash Speculative Decoding

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast…

计算与语言 · 计算机科学 2026-05-29 Jian Chen , Yesheng Liang , Zhijian Liu

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3…

计算与语言 · 计算机科学 2026-05-22 Weijie Shi , Qiang Xu , Fan Deng , Yaguang Wu , Jiarun Liu , Yehong Xu , Hao Chen , Jia Zhu , Jiajie Xu , Xiangjun Huang , Jian Yang , Xiaofang Zhou

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

计算与语言 · 计算机科学 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

计算与语言 · 计算机科学 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter…

计算与语言 · 计算机科学 2025-10-06 Guanghao Li , Zhihui Fu , Min Fang , Qibin Zhao , Ming Tang , Chun Yuan , Jun Wang

Speculative Decoding with a Speculative Vocabulary

Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting…

计算与语言 · 计算机科学 2026-02-17 Miles Williams , Young D. Kwon , Rui Li , Alexandros Kouris , Stylianos I. Venieris

OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure

Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become…

计算与语言 · 计算机科学 2025-04-25 Jikai Wang , Yi Su , Juntao Li , Qingrong Xia , Zi Ye , Xinyu Duan , Zhefeng Wang , Min Zhang

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward…

机器学习 · 计算机科学 2026-05-20 Tianyu Wu , Yu Yao , Zhenting Qi , Han Zheng , Zhuohan Wang , Haoran Ma , Lawrence Liao , Himabindu Lakkaraju , Ju Li , Yilun Du

Self Speculative Decoding for Diffusion Large Language Models

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results…

计算与语言 · 计算机科学 2025-10-07 Yifeng Gao , Ziang Ji , Yuxuan Wang , Biqing Qi , Hanlin Xu , Linfeng Zhang

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work…

机器学习 · 计算机科学 2026-02-20 Rahul Thomas , Teo Kitanovski , Micah Goldblum , Arka Pal

Speculative Decoding for Autoregressive Video Generation

Autoregressive video diffusion is emerging as a promising paradigm for streaming video synthesis, with step distillation serving as the primary means of accelerating inference. Whether speculative decoding, the dominant acceleration…

计算机视觉与模式识别 · 计算机科学 2026-04-21 Yuezhou Hu , Jintao Zhang

Traversal Verification for Speculative Tree Decoding

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

计算与语言 · 计算机科学 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

DEER: Draft with Diffusion, Verify with Autoregressive Models

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

机器学习 · 计算机科学 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised…

计算与语言 · 计算机科学 2026-05-29 Haodi Lei , Yafy Li , Haoran Zhang , Shunkai Zhang , Qianjia Cheng , Xiaoye Qu , Ganqu Cui , Bowen Zhou , Ning Ding , Yun Luo , Yu Cheng

FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive…

机器学习 · 计算机科学 2025-09-26 Haiduo Huang , Jiangcheng Song , Wenzhe Zhao , Pengju Ren

Dynamic Depth Decoding: Faster Speculative Decoding for LLMs

The acceleration of Large Language Models (LLMs) with speculative decoding provides a significant runtime improvement without any loss of accuracy. Currently, EAGLE-2 is the state-of-the-art speculative decoding method, improving on EAGLE…

计算与语言 · 计算机科学 2024-09-04 Oscar Brown , Zhengjie Wang , Andrea Do , Nikhil Mathew , Cheng Yu

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a…

机器学习 · 计算机科学 2026-01-29 Rui Pan , Zhuofu Chen , Hongyi Liu , Arvind Krishnamurthy , Ravi Netravali

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive…

计算与语言 · 计算机科学 2025-11-05 Jameson Sandler , Jacob K. Christopher , Thomas Hartvigsen , Ferdinando Fioretto

DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the…

机器学习 · 计算机科学 2026-04-08 Satyam Goyal , Kushal Patel , Tanush Mittal , Arjun Laxman

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

机器学习 · 计算机科学 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou