English
Related papers

Related papers: Fast Large Language Model Collaborative Decoding v…

200 papers

Large Language Models (LLMs) often excel in specific domains but fall short in others due to the limitations of their training. Thus, enabling LLMs to solve problems collaboratively by integrating their complementary knowledge promises to…

Computation and Language · Computer Science 2025-03-20 Ziyao Wang , Muneeza Azmat , Ang Li , Raya Horesh , Mikhail Yurochkin

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Large Language Models (LLMs) exhibit impressive capabilities across various applications but encounter substantial challenges such as high inference latency, considerable training costs, and the generation of hallucinations. Collaborative…

Computation and Language · Computer Science 2024-10-24 Kaiyan Zhang , Jianyu Wang , Ning Ding , Biqing Qi , Ermo Hua , Xingtai Lv , Bowen Zhou

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training…

Computation and Language · Computer Science 2025-05-27 Yepeng Weng , Dianwen Mei , Huishi Qiu , Xujie Chen , Li Liu , Jiang Tian , Zhongchao Shi

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

Large Language Models (LLMs) exhibit remarkable generative capabilities, enabling the generation of valuable information. Despite these advancements, previous research found that LLMs sometimes struggle with adhering to specific constraints…

Artificial Intelligence · Computer Science 2024-02-27 Kaiwen Wei , Jingyuan Zhang , Hongzhi Zhang , Fuzheng Zhang , Di Zhang , Li Jin , Yue Yu

Speculative decoding is an emerging technique that accelerates large language model (LLM) inference by allowing a smaller draft model to predict multiple tokens in advance, which are then verified or corrected by a larger target model. In…

Signal Processing · Electrical Eng. & Systems 2025-11-10 Ce Zheng , Tingting Yang

Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-16 Luyao Gao , Jianchun Liu , Hongli Xu , Xichong Zhang , Yunming Liao , Liusheng Huang

We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the…

Computation and Language · Computer Science 2024-08-28 Shannon Zejiang Shen , Hunter Lang , Bailin Wang , Yoon Kim , David Sontag

Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently…

Computation and Language · Computer Science 2025-09-08 Shengyin Sun , Yiming Li , Xing Li , Yingzhao Lian , Weizhe Lin , Hui-Ling Zhen , Zhiyuan Yang , Chen Chen , Xianzhi Yu , Mingxuan Yuan , Chen Ma

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less…

Machine Learning · Computer Science 2026-02-17 Zongle Huang , Lei Zhu , Zongyuan Zhan , Ting Hu , Weikai Mao , Xianzhi Yu , Yongpan Liu , Tianyu Zhang

Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model's outputs. However, its efficacy can be limited due to the low predictive…

Artificial Intelligence · Computer Science 2024-06-11 Xiaoxuan Liu , Lanxiang Hu , Peter Bailis , Alvin Cheung , Zhijie Deng , Ion Stoica , Hao Zhang

This study investigates semantic uncertainty in large language model (LLM) outputs across different decoding methods, focusing on emerging techniques like speculative sampling and chain-of-thought (CoT) decoding. Through experiments on…

Computation and Language · Computer Science 2025-06-24 Darius Foodeei , Simin Fan , Martin Jaggi

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haiduo Huang , Fuwei Yang , Zhenhua Liu , Xuanwu Yin , Dong Li , Pengju Ren , Emad Barsoum

The rapid advancement of large language models (LLMs) has revolutionized code generation tasks across various programming languages. However, the unique characteristics of programming languages, particularly those like Verilog with specific…

Machine Learning · Computer Science 2025-03-19 Changran Xu , Yi Liu , Yunhao Zhou , Shan Huang , Ningyi Xu , Qiang Xu

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto
‹ Prev 1 2 3 10 Next ›