Related papers: SpecBlock: Block-Iterative Speculative Decoding wi…

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

Computation and Language · Computer Science 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

Speculative Decoding with a Speculative Vocabulary

Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting…

Computation and Language · Computer Science 2026-02-17 Miles Williams , Young D. Kwon , Rui Li , Alexandros Kouris , Stylianos I. Venieris

SpecBranch: Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Yuhao Shen , Junyi Shen , Quan Kong , Tianyu Liu , Yao Lu , Cong Wang

Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency…

Computation and Language · Computer Science 2025-12-15 Nikhil Bhendawade , Kumari Nishu , Arnav Kundu , Chris Bartels , Minsik Cho , Irina Belousova

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an…

Computation and Language · Computer Science 2026-04-15 Liran Ringel , Yaniv Romano

Block Verification Accelerates Speculative Decoding

Speculative decoding is an effective method for lossless acceleration of large language models during inference. It uses a fast model to draft a block of tokens which are then verified in parallel by the target model, and provides a…

Machine Learning · Computer Science 2025-04-14 Ziteng Sun , Uri Mendlovic , Yaniv Leviathan , Asaf Aharoni , Jae Hun Ro , Ahmad Beirami , Ananda Theertha Suresh

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call…

Machine Learning · Computer Science 2026-05-12 Doğaç Eldenk , Payal Mohapatra , Yigitcan Comlek , Kaan Oktay , Hongyang Zhang , Stephen Xia

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential…

Computation and Language · Computer Science 2026-05-20 Yaojie Zhang , Jianuo Huang , Junlong Ke , Yuhang Han , Yongji Long , Tianchen Zhao , Biqing Qi , Linfeng Zhang

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

Computation and Language · Computer Science 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However,…

Machine Learning · Computer Science 2026-03-20 Shenggui Li , Chao Wang , Yikai Zhu , Yubo Wang , Fan Yin , Shuai Shi , Yefei Chen , Xiaomin Dong , Qiaoling Chen , Jin Pan , Ji Li , Laixin Xie , Yineng Zhang , Lei Yu , Yonggang Wen , Ivor Tsang , Tianwei Zhang

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft…

Computation and Language · Computer Science 2026-04-15 Zhuofan Wen , Yang Feng

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haiduo Huang , Fuwei Yang , Zhenhua Liu , Xuanwu Yin , Dong Li , Pengju Ren , Emad Barsoum

Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs

Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the…

Machine Learning · Computer Science 2026-04-24 Hongyi Liu , Jiaji Huang , Zhen Jia , Youngsuk Park , Yu-Xiang Wang

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which…

Computation and Language · Computer Science 2025-05-30 Yudi Zhang , Weilin Zhao , Xu Han , Tiejun Zhao , Wang Xu , Hailong Cao , Conghui Zhu

FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive…

Machine Learning · Computer Science 2025-09-26 Haiduo Huang , Jiangcheng Song , Wenzhe Zhao , Pengju Ren

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive…

Computation and Language · Computer Science 2025-11-05 Jameson Sandler , Jacob K. Christopher , Thomas Hartvigsen , Ferdinando Fioretto

DFlash: Block Diffusion for Flash Speculative Decoding

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast…

Computation and Language · Computer Science 2026-05-29 Jian Chen , Yesheng Liang , Zhijian Liu

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

SpecPipe: Accelerating Pipeline Parallelism-based LLM Inference with Speculative Decoding

The demand for large language model inference is rapidly increasing. Pipeline parallelism offers a cost-effective deployment strategy for distributed inference but suffers from high service latency. While incorporating speculative decoding…

Machine Learning · Computer Science 2025-09-01 Haofei Yin , Mengbai Xiao , Tinghong Li , Xiao Zhang , Dongxiao Yu , Guanghui Zhang

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured…

Computation and Language · Computer Science 2026-01-13 Tianyu Liu , Qitan Lv , Yuhao Shen , Xiao Sun , Xiaoyan Sun