Related papers: Online Speculative Decoding

Decoding Speculative Decoding

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

When Drafts Evolve: Speculative Decoding Meets Online Learning

Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model.…

Machine Learning · Computer Science 2026-03-16 Yu-Yang Qian , Hao-Cong Wu , Yichao Fu , Hao Zhang , Peng Zhao

Speculative Streaming: Fast LLM Inference without Auxiliary Models

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both…

Computation and Language · Computer Science 2024-02-20 Nikhil Bhendawade , Irina Belousova , Qichen Fu , Henry Mason , Mohammad Rastegari , Mahyar Najibi

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

Computation and Language · Computer Science 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Inference acceleration of large language models (LLMs) has been put forward in many application scenarios and speculative decoding has shown its advantage in addressing inference acceleration. Speculative decoding usually introduces a draft…

Machine Learning · Computer Science 2024-12-03 Zhuofan Wen , Shangtong Gui , Yang Feng

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs

Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the…

Machine Learning · Computer Science 2026-04-24 Hongyi Liu , Jiaji Huang , Zhen Jia , Youngsuk Park , Yu-Xiang Wang

Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific…

Computation and Language · Computer Science 2025-03-27 Fenglu Hong , Ravi Raju , Jonathan Lingjie Li , Bo Li , Urmish Thakker , Avinash Ravichandran , Swayambhoo Jain , Changran Hu

Confidence-Modulated Speculative Decoding for Large Language Models

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid…

Computation and Language · Computer Science 2026-05-29 Jaydip Sen , Subhasis Dasgupta , Hetvi Waghela

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in…

Computation and Language · Computer Science 2024-11-12 Euiin Yi , Taehyeon Kim , Hongseok Jeung , Du-Seong Chang , Se-Young Yun

An Interpretable Latency Model for Speculative Decoding in LLM Serving

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups…

Machine Learning · Computer Science 2026-05-15 Linghao Kong , Megan Flynn , Michael Peng , Nir Shavit , Mark Kurtz , Alexandre Marques

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

Speculative Decoding Reimagined for Multimodal Large Language Models

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy.…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Rongrong Ji

Speculative Decoding with a Speculative Vocabulary

Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting…

Computation and Language · Computer Science 2026-02-17 Miles Williams , Young D. Kwon , Rui Li , Alexandros Kouris , Stylianos I. Venieris

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

3-Model Speculative Decoding

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Speculative Decoding: Performance or Illusion?

Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch…

Computation and Language · Computer Science 2026-03-19 Xiaoxuan Liu , Jiaxiang Yu , Jongseok Park , Ion Stoica , Alvin Cheung

Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding

Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft model to predict multiple tokens in advance, which are then verified or corrected by a larger target model. In…

Signal Processing · Electrical Eng. & Systems 2025-11-10 Ce Zheng , Tingting Yang

Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target…

Computation and Language · Computer Science 2026-04-21 Sungkyun Kim , Jaemin Kim , Dogyung Yoon , Jiho Shin , Junyeol Lee , Jiwon Seo