Related papers: Fast Large Language Model Collaborative Decoding v…

Speculate, then Collaborate: Fusing Knowledge of Language Models during Decoding

Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to…

Computation and Language · Computer Science 2025-03-20 Ziyao Wang , Muneeza Azmat , Ang Li , Raya Horesh , Mikhail Yurochkin

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Large Language Models (LLMs) exhibit impressive capabilities across various applications but encounter substantial challenges such as high inference latency, considerable training costs, and the generation of hallucinations. Collaborative…

Computation and Language · Computer Science 2024-10-24 Kaiyan Zhang , Jianyu Wang , Ning Ding , Biqing Qi , Ermo Hua , Xingtai Lv , Bowen Zhou

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Decoding Speculative Decoding

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter

Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training…

Computation and Language · Computer Science 2025-05-27 Yepeng Weng , Dianwen Mei , Huishi Qiu , Xujie Chen , Li Liu , Jiang Tian , Zhongchao Shi

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

Chain-of-Specificity: An Iteratively Refining Method for Eliciting Knowledge from Large Language Models

Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information. Despite these advancements, previous research found that LLMs sometimes struggle with adhering to specific constraints…

Artificial Intelligence · Computer Science 2024-02-27 Kaiwen Wei , Jingyuan Zhang , Hongzhi Zhang , Fuzheng Zhang , Di Zhang , Li Jin , Yue Yu

Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding

Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft model to predict multiple tokens in advance, which are then verified or corrected by a larger target model. In…

Signal Processing · Electrical Eng. & Systems 2025-11-10 Ce Zheng , Tingting Yang

Collaborative Speculative Inference for Efficient LLM Inference Serving

Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-16 Luyao Gao , Jianchun Liu , Hongli Xu , Xichong Zhang , Yunming Liao , Liusheng Huang

Learning to Decode Collaboratively with Multiple Language Models

We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the…

Computation and Language · Computer Science 2024-08-28 Shannon Zejiang Shen , Hunter Lang , Bailin Wang , Yoon Kim , David Sontag

Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently…

Computation and Language · Computer Science 2025-09-08 Shengyin Sun , Yiming Li , Xing Li , Yingzhao Lian , Weizhe Lin , Hui-Ling Zhen , Zhiyuan Yang , Chen Chen , Xianzhi Yu , Mingxuan Yuan , Chen Ma

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less…

Machine Learning · Computer Science 2026-02-17 Zongle Huang , Lei Zhu , Zongyuan Zhan , Ting Hu , Weikai Mao , Xianzhi Yu , Yongpan Liu , Tianyu Zhang

Online Speculative Decoding

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive…

Artificial Intelligence · Computer Science 2024-06-11 Xiaoxuan Liu , Lanxiang Hu , Peter Bailis , Alvin Cheung , Zhijie Deng , Ion Stoica , Hao Zhang

Semantic uncertainty in advanced decoding methods for LLM generation

This study investigates semantic uncertainty in large language model (LLM) outputs across different decoding methods, focusing on emerging techniques like speculative sampling and chain-of-thought (CoT) decoding. Through experiments on…

Computation and Language · Computer Science 2025-06-24 Darius Foodeei , Simin Fan , Martin Jaggi

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haiduo Huang , Fuwei Yang , Zhenhua Liu , Xuanwu Yin , Dong Li , Pengju Ren , Emad Barsoum

Speculative Decoding for Verilog: Speed and Quality, All in One

The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific…

Machine Learning · Computer Science 2025-03-19 Changran Xu , Yi Liu , Yunhao Zhou , Shan Huang , Ningyi Xu , Qiang Xu

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto