相关论文: Self Speculative Decoding for Diffusion Large Lang…

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes…

计算与语言 · 计算机科学 2026-05-18 Shengyin Sun , Yiming Li , Renxi Liu , Xinqi Li , Hui-Ling Zhen , Weizhe Lin , Chen Chen , Xianzhi Yu , Mingxuan Yuan , Chen Ma

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

计算与语言 · 计算机科学 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

Tutorial Proposal: Speculative Decoding for Efficient LLM Inference

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative…

计算与语言 · 计算机科学 2025-03-04 Heming Xia , Cunxiao Du , Yongqi Li , Qian Liu , Wenjie Li

DFlash: Block Diffusion for Flash Speculative Decoding

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast…

计算与语言 · 计算机科学 2026-05-29 Jian Chen , Yesheng Liang , Zhijian Liu

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain…

机器学习 · 计算机科学 2025-12-02 Fengze Yu , Leshu Li , Brad McDanel , Sai Qian Zhang

Speculative Safety-Aware Decoding

Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with…

机器学习 · 计算机科学 2025-09-30 Xuekang Wang , Shengyu Zhu , Xueqi Cheng

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

分布式、并行与集群计算 · 计算机科学 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token…

计算与语言 · 计算机科学 2024-07-03 Parsa Kavehzadeh , Mohammadreza Pourreza , Mojtaba Valipour , Tinashu Zhu , Haoli Bai , Ali Ghodsi , Boxing Chen , Mehdi Rezagholizadeh

An Empirical Study of Speculative Decoding on Software Engineering Tasks

Large Language Models (LLMs) have become widely used for Software Engineering (SE) tasks, spanning from function-level code generation to complex repository-level workflows. However, the high latency of autoregressive inference remains a…

软件工程 · 计算机科学 2026-05-05 Yijia Li , Junkai Chen , Xing Hu , Xin Xia

Self-Speculative Biased Decoding for Faster Re-Translation

Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly…

计算与语言 · 计算机科学 2026-01-06 Linxiao Zeng , Haoyun Deng , Kangyuan Shu , Shizhen Wang

Speculative Decoding: Performance or Illusion?

Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch…

计算与语言 · 计算机科学 2026-03-19 Xiaoxuan Liu , Jiaxiang Yu , Jongseok Park , Ion Stoica , Alvin Cheung

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a…

人工智能 · 计算机科学 2025-03-17 Zongyue Qin , Zifan He , Neha Prakriya , Jason Cong , Yizhou Sun

Speculative Decoding Reimagined for Multimodal Large Language Models

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy.…

计算机视觉与模式识别 · 计算机科学 2025-05-21 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Rongrong Ji

Speculative Speculative Decoding

Autoregressive decoding is bottlenecked by its sequential nature. Speculative decoding has become a standard way to accelerate inference by using a fast draft model to predict upcoming tokens from a slower target model, and then verifying…

机器学习 · 计算机科学 2026-05-06 Tanishq Kumar , Tri Dao , Avner May

SSSD: Simply-Scalable Speculative Decoding

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial…

计算与语言 · 计算机科学 2026-01-08 Michele Marzollo , Jiawei Zhuang , Niklas Roemer , Niklas Zwingenberger , Lorenz K. Müller , Lukas Cavigelli

KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and…

计算与语言 · 计算机科学 2026-01-21 Mingbo Song , Heming Xia , Jun Zhang , Chak Tou Leong , Qiancheng Xu , Wenjie Li , Sujian Li

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

计算与语言 · 计算机科学 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to…

计算与语言 · 计算机科学 2025-03-07 Heming Xia , Yongqi Li , Jun Zhang , Cunxiao Du , Wenjie Li

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

计算与语言 · 计算机科学 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their…

声音 · 计算机科学 2025-06-04 Zijian Lin , Yang Zhang , Yougen Yuan , Yuming Yan , Jinjiang Liu , Zhiyong Wu , Pengfei Hu , Qun Yu