Related papers: OPT-Tree: Speculative Decoding with Adaptive Draft…

SpecTr: Fast Speculative Decoding via Optimal Transport

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks.…

Machine Learning · Computer Science 2024-01-19 Ziteng Sun , Ananda Theertha Suresh , Jae Hun Ro , Ahmad Beirami , Himanshu Jain , Felix Yu

Dynamic Delayed Tree Expansion For Improved Multi-Path Speculative Decoding

Multi-path speculative decoding accelerates lossless sampling from a target model by using a cheaper draft model to generate a draft tree of tokens, and then applies a verification algorithm that accepts a subset of these. While prior work…

Machine Learning · Computer Science 2026-02-20 Rahul Thomas , Teo Kitanovski , Micah Goldblum , Arka Pal

Traversal Verification for Speculative Tree Decoding

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

Computation and Language · Computer Science 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation,…

Computation and Language · Computer Science 2026-05-07 Yepeng Weng , Qiao Hu , Takehisa Yairi

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM…

Machine Learning · Computer Science 2026-05-20 Yuhao Shen , Tianyu Liu , Xinyi Hu , Quan Kong , Baolin Zhang , Jun Dai , Jun Zhang , Shuang Ge , Lei Chen , Yue Li , Mingcheng Wan , Cong Wang

Fast Inference of Visual Autoregressive Model with Adjacency-Adaptive Dynamical Draft Trees

Autoregressive (AR) image models achieve diffusion-level quality but suffer from sequential inference, requiring approximately 2,000 steps for a 576x576 image. Speculative decoding with draft trees accelerates LLMs yet underperforms on…

Computer Vision and Pattern Recognition · Computer Science 2025-12-29 Haodong Lei , Hongsong Wang , Xin Geng , Liang Wang , Pan Zhou

Confidence-Modulated Speculative Decoding for Large Language Models

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid…

Computation and Language · Computer Science 2026-05-29 Jaydip Sen , Subhasis Dasgupta , Hetvi Waghela

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured…

Computation and Language · Computer Science 2026-01-13 Tianyu Liu , Qitan Lv , Yuhao Shen , Xiao Sun , Xiaoyan Sun

When Drafts Evolve: Speculative Decoding Meets Online Learning

Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model.…

Machine Learning · Computer Science 2026-03-16 Yu-Yang Qian , Hao-Cong Wu , Yichao Fu , Hao Zhang , Peng Zhao

Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

Tree-based speculative decoding accelerates autoregressive generation by verifying multiple draft candidates in parallel, but this advantage weakens for sparse Mixture-of-Experts (MoE) models. As the draft tree grows, different branches…

Computation and Language · Computer Science 2026-05-04 Lehan Pan , Ziyang Tao , Ruoyu Pang , Xiao Wang , Jianjun Zhao , Yanyong Zhang

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

Machine Learning · Computer Science 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Minimization

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample…

Machine Learning · Computer Science 2025-11-21 Rahul Krishna Thomas , Arka Pal

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised…

Computation and Language · Computer Science 2026-05-29 Haodi Lei , Yafy Li , Haoran Zhang , Shunkai Zhang , Qianjia Cheng , Xiaoye Qu , Ganqu Cui , Bowen Zhou , Ning Ding , Yun Luo , Yu Cheng

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this…

Computation and Language · Computer Science 2026-05-29 Shuyu Zhang , Lingfeng Pan , Qicheng Wang , Yaqi Shi , Yueyang Tan , Ruyu Yan , Jiaqi Chen , Lixing Du , Lu Wang

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

Computation and Language · Computer Science 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

STree: Speculative Tree Decoding for Hybrid State-Space Models

Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models.…

Machine Learning · Computer Science 2025-10-29 Yangchao Wu , Zongyue Qin , Alex Wong , Stefano Soatto

Accelerating Speculative Decoding with Block Diffusion Draft Trees

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an…

Computation and Language · Computer Science 2026-04-15 Liran Ringel , Yaniv Romano

Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

Computation and Language · Computer Science 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji