Related papers: Multi-Candidate Speculative Decoding

Improving Multi-candidate Speculative Decoding

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency,…

Computation and Language · Computer Science 2024-12-17 Xiaofan Lu , Yixiao Zeng , Feiyang Ma , Zixu Yu , Marco Levorato

Confidence-Modulated Speculative Decoding for Large Language Models

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid…

Computation and Language · Computer Science 2026-05-29 Jaydip Sen , Subhasis Dasgupta , Hetvi Waghela

Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration

Inference acceleration of large language models (LLMs) has been put forward in many application scenarios and speculative decoding has shown its advantage in addressing inference acceleration. Speculative decoding usually introduces a draft…

Machine Learning · Computer Science 2024-12-03 Zhuofan Wen , Shangtong Gui , Yang Feng

Online Speculative Decoding

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive…

Artificial Intelligence · Computer Science 2024-06-11 Xiaoxuan Liu , Lanxiang Hu , Peter Bailis , Alvin Cheung , Zhijie Deng , Ion Stoica , Hao Zhang

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in…

Computation and Language · Computer Science 2024-11-12 Euiin Yi , Taehyeon Kim , Hongseok Jeung , Du-Seong Chang , Se-Young Yun

Traversal Verification for Speculative Tree Decoding

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

Computation and Language · Computer Science 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

Speculative Decoding for Multi-Sample Inference

We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios, such as self-consistency and Best-of-N sampling. Our method exploits the intrinsic consensus of parallel generation paths to synthesize…

Computation and Language · Computer Science 2025-03-10 Yiwei Li , Jiayi Shi , Shaoxiong Feng , Peiwen Yuan , Xinglin Wang , Yueqi Zhang , Ji Zhang , Chuyi Tan , Boyuan Pan , Yao Hu , Kan Li

Accelerating Large Language Model Decoding with Speculative Sampling

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of…

Computation and Language · Computer Science 2023-02-03 Charlie Chen , Sebastian Borgeaud , Geoffrey Irving , Jean-Baptiste Lespiau , Laurent Sifre , John Jumper

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token…

Computation and Language · Computer Science 2024-07-03 Parsa Kavehzadeh , Mohammadreza Pourreza , Mojtaba Valipour , Tinashu Zhu , Haoli Bai , Ali Ghodsi , Boxing Chen , Mehdi Rezagholizadeh

Accelerating Production LLMs with Combined Token/Embedding Speculators

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both…

Computation and Language · Computer Science 2024-06-10 Davis Wertheimer , Joshua Rosenkranz , Thomas Parnell , Sahil Suneja , Pavithra Ranganathan , Raghu Ganti , Mudhakar Srivatsa

Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

Computation and Language · Computer Science 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Machine Learning · Computer Science 2025-02-03 Gregor Bachmann , Sotiris Anagnostidis , Albert Pumarola , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Edgar Schönfeld , Ali Thabet , Jonas Kohler

Decoding Speculative Decoding

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in…

Machine Learning · Computer Science 2025-05-27 Yixuan Wang , Yijun Liu , Shiyu ji , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

Multi-Drafter Speculative Decoding with Alignment Feedback

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens.…

Computation and Language · Computer Science 2026-04-08 Taehyeon Kim , Hojung Jung , Se-Young Yun

When Drafts Evolve: Speculative Decoding Meets Online Learning

Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model.…

Machine Learning · Computer Science 2026-03-16 Yu-Yang Qian , Hao-Cong Wu , Yichao Fu , Hao Zhang , Peng Zhao

Multi-Draft Speculative Sampling: Canonical Decomposition and Theoretical Limits

We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an…

Computation and Language · Computer Science 2025-05-12 Ashish Khisti , M. Reza Ebrahimi , Hassan Dbouk , Arash Behboodi , Roland Memisevic , Christos Louizos

Not-a-Bandit: Provably No-Regret Drafter Selection in Speculative Decoding for LLMs

Speculative decoding is widely used in accelerating large language model (LLM) inference. In this work, we focus on the online draft model selection problem in speculative decoding. We design an algorithm that provably competes with the…

Machine Learning · Computer Science 2026-04-24 Hongyi Liu , Jiaji Huang , Zhen Jia , Youngsuk Park , Yu-Xiang Wang

Make Every Draft Count: Hidden State based Speculative Decoding

Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this…

Computation and Language · Computer Science 2026-02-26 Yuetao Chen , Xuliang Wang , Xinzhou Zheng , Ming Li , Peng Wang , Hong Xu