相关论文: DFlash: Block Diffusion for Flash Speculative Deco…

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an…

计算与语言 · 计算机科学 2026-04-15 Liran Ringel , Yaniv Romano

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a…

机器学习 · 计算机科学 2026-01-29 Rui Pan , Zhuofu Chen , Hongyi Liu , Arvind Krishnamurthy , Ravi Netravali

SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Large language models and large multimodal models (LLMs and LMMs) deliver strong generative performance but suffer from slow decoding, a problem that becomes more severe when handling visual inputs, whose sequences typically contain many…

计算机视觉与模式识别 · 计算机科学 2026-02-04 Zihua Wang , Ruibo Li , Haozhe Du , Joey Tianyi Zhou , Yu Zhang , Xu Yang

Self Speculative Decoding for Diffusion Large Language Models

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results…

计算与语言 · 计算机科学 2025-10-07 Yifeng Gao , Ziang Ji , Yuxuan Wang , Biqing Qi , Hanlin Xu , Linfeng Zhang

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

计算与语言 · 计算机科学 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive…

计算与语言 · 计算机科学 2025-11-05 Jameson Sandler , Jacob K. Christopher , Thomas Hartvigsen , Ferdinando Fioretto

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

计算与语言 · 计算机科学 2024-11-28 Hyun Ryu , Eric Kim

DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the…

机器学习 · 计算机科学 2026-04-08 Satyam Goyal , Kushal Patel , Tanush Mittal , Arjun Laxman

DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter…

计算与语言 · 计算机科学 2025-10-06 Guanghao Li , Zhihui Fu , Min Fang , Qibin Zhao , Ming Tang , Chun Yuan , Jun Wang

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

计算与语言 · 计算机科学 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often…

机器学习 · 计算机科学 2026-01-15 Sudhanshu Agrawal , Risheek Garrepalli , Raghavv Goel , Mingu Lee , Christopher Lott , Fatih Porikli

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

计算与语言 · 计算机科学 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Diffusion LLMs Can Do Faster-Than-AR Inference via Discrete Diffusion Forcing

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source…

机器学习 · 计算机科学 2025-08-14 Xu Wang , Chenkai Xu , Yijie Jin , Jiachun Jin , Hao Zhang , Zhijie Deng

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

计算与语言 · 计算机科学 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

计算与语言 · 计算机科学 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

Decoding Speculative Decoding

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

机器学习 · 计算机科学 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

计算与语言 · 计算机科学 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM…

机器学习 · 计算机科学 2026-04-20 Xiang Xia , Wuyang Zhang , Jiazheng Liu , Cheng Yan , Yanyong Zhang

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both…

计算与语言 · 计算机科学 2024-02-20 Nikhil Bhendawade , Irina Belousova , Qichen Fu , Henry Mason , Mohammad Rastegari , Mahyar Najibi

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

计算与语言 · 计算机科学 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel