Related papers: AMUSD: Asynchronous Multi-Device Speculative Decod…

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

Computation and Language · Computer Science 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

Speculative Decoding Reimagined for Multimodal Large Language Models

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy.…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Rongrong Ji

Improving Multi-candidate Speculative Decoding

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency,…

Computation and Language · Computer Science 2024-12-17 Xiaofan Lu , Yixiao Zeng , Feiyang Ma , Zixu Yu , Marco Levorato

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding

Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by…

Computation and Language · Computer Science 2024-11-11 Ryan Sun , Tianyi Zhou , Xun Chen , Lichao Sun

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

3-Model Speculative Decoding

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

RASD: Retrieval-Augmented Speculative Decoding

Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model…

Computation and Language · Computer Science 2025-03-06 Guofeng Quan , Wenfeng Feng , Chuzhan Hao , Guochao Jiang , Yuewei Zhang , Hao Wang

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

Make Every Draft Count: Hidden State based Speculative Decoding

Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this…

Computation and Language · Computer Science 2026-02-26 Yuetao Chen , Xuliang Wang , Xinzhou Zheng , Ming Li , Peng Wang , Hong Xu

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Graph-Structured Speculative Decoding

Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness…

Computation and Language · Computer Science 2024-07-24 Zhuocheng Gong , Jiahao Liu , Ziyue Wang , Pengfei Wu , Jingang Wang , Xunliang Cai , Dongyan Zhao , Rui Yan

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a…

Artificial Intelligence · Computer Science 2025-03-17 Zongyue Qin , Zifan He , Neha Prakriya , Jason Cong , Yizhou Sun

TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental…

Computation and Language · Computer Science 2026-05-05 Sibo Xiao , Jinyuan Fu , Zhongle Xie , Lidan Shou

Accelerating Diffusion LLMs via Adaptive Parallel Decoding

The generation speed of LLMs are bottlenecked by autoregressive decoding, where tokens are predicted sequentially one by one. Alternatively, diffusion large language models (dLLMs) theoretically allow for parallel token generation, but in…

Computation and Language · Computer Science 2025-11-03 Daniel Israel , Guy Van den Broeck , Aditya Grover

SpecFLASH: A Latent-Guided Semi-autoregressive Speculative Decoding Framework for Efficient Multimodal Generation

Large language models and large multimodal models (LLMs and LMMs) deliver strong generative performance but suffer from slow decoding, a problem that becomes more severe when handling visual inputs, whose sequences typically contain many…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Zihua Wang , Ruibo Li , Haozhe Du , Joey Tianyi Zhou , Yu Zhang , Xu Yang