Related papers: Test-Time Speculation

Online Speculative Decoding

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive…

Artificial Intelligence · Computer Science 2024-06-11 Xiaoxuan Liu , Lanxiang Hu , Peter Bailis , Alvin Cheung , Zhijie Deng , Ion Stoica , Hao Zhang

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

TAPS: Task Aware Proposal Distributions for Speculative Sampling

Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad…

Computation and Language · Computer Science 2026-03-31 Mohamad Zbib , Mohamad Bazzi , Ammar Mohanna , Hasan Abed Al Kader Hammoud , Bernard Ghanem

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Inference acceleration of large language models (LLMs) has been put forward in many application scenarios and speculative decoding has shown its advantage in addressing inference acceleration. Speculative decoding usually introduces a draft…

Machine Learning · Computer Science 2024-12-03 Zhuofan Wen , Shangtong Gui , Yang Feng

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call…

Machine Learning · Computer Science 2026-05-12 Doğaç Eldenk , Payal Mohapatra , Yigitcan Comlek , Kaan Oktay , Hongyang Zhang , Stephen Xia

Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target…

Computation and Language · Computer Science 2026-04-21 Sungkyun Kim , Jaemin Kim , Dogyung Yoon , Jiho Shin , Junyeol Lee , Jiwon Seo

Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in…

Machine Learning · Computer Science 2025-05-27 Yixuan Wang , Yijun Liu , Shiyu ji , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

POSS: Position Specialist Generates Better Draft for Speculative Decoding

Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the…

Computation and Language · Computer Science 2025-06-05 Langlin Huang , Chengsong Huang , Jixuan Leng , Di Huang , Jiaxin Huang

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Machine Learning · Computer Science 2025-02-03 Gregor Bachmann , Sotiris Anagnostidis , Albert Pumarola , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Edgar Schönfeld , Ali Thabet , Jonas Kohler

Accelerating Production LLMs with Combined Token/Embedding Speculators

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both…

Computation and Language · Computer Science 2024-06-10 Davis Wertheimer , Joshua Rosenkranz , Thomas Parnell , Sahil Suneja , Pavithra Ranganathan , Raghu Ganti , Mudhakar Srivatsa

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised…

Computation and Language · Computer Science 2026-05-29 Haodi Lei , Yafy Li , Haoran Zhang , Shunkai Zhang , Qianjia Cheng , Xiaoye Qu , Ganqu Cui , Bowen Zhou , Ning Ding , Yun Luo , Yu Cheng

Decoding Speculative Decoding

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Flatter Tokens are More Valuable for Speculative Draft Model Training

Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that…

Computation and Language · Computer Science 2026-02-19 Jiaming Fan , Daming Cao , Xiangzhong Luo , Jiale Fu , Chonghan Liu , Xu Yang

Fast Collaborative Inference via Distributed Speculative Decoding

Speculative decoding accelerates large language model (LLM) inference by allowing a small draft model to predict multiple future tokens for verification by a larger target model. In AI-native radio access networks (AI-RAN), this enables…

Signal Processing · Electrical Eng. & Systems 2026-01-13 Ce Zheng , Ke Zhang , Chen Sun , Wenqi Zhang , Qiong Liu , Angesom Ataklity Tesfay

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically…

Computation and Language · Computer Science 2026-05-19 Shuoyang Sun , Chang Dai , Hao Fang , Kuofeng Gao , Xinhao Zhong , Yi Sun , Fan Mo , Shu-Tao Xia , Bin Chen

Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding

Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft model to predict multiple tokens in advance, which are then verified or corrected by a larger target model. In…

Signal Processing · Electrical Eng. & Systems 2025-11-10 Ce Zheng , Tingting Yang

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly…

Machine Learning · Computer Science 2026-03-02 Alexander Samarin , Sergei Krutikov , Anton Shevtsov , Sergei Skvortsov , Filipp Fisin , Alexander Golubev

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in…

Computation and Language · Computer Science 2024-11-12 Euiin Yi , Taehyeon Kim , Hongseok Jeung , Du-Seong Chang , Se-Young Yun

Acceptance Dynamics Across Cognitive Domains in Speculative Decoding

Speculative decoding accelerates large language model (LLM) inference. It uses a small draft model to propose a tree of future tokens. A larger target model then verifies these tokens in a single batched forward pass. Despite the growing…

Artificial Intelligence · Computer Science 2026-04-17 Saif Mahmoud

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement…

Machine Learning · Computer Science 2026-03-23 Qinghao Hu , Shang Yang , Junxian Guo , Xiaozhe Yao , Yujun Lin , Yuxian Gu , Han Cai , Chuang Gan , Ana Klimovic , Song Han