Related papers: Parallel Prefix Verification for Speculative Gener…

PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However,…

Computation and Language · Computer Science 2025-02-18 Tianyu Liu , Yun Li , Qitan Lv , Kai Liu , Jianchen Zhu , Winston Hu , Xiao Sun

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Accelerate Speculative Decoding with Sparse Computation in Verification

Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel. However, the verification stage often becomes the dominant computational bottleneck, especially for long-context inputs…

Computation and Language · Computer Science 2025-12-29 Jikai Wang , Jianchao Tan , Yuxuan Hu , Jiayu Qin , Yerui Sun , Yuchen Xie , Xunliang Cai , Juntao Li , Min Zhang

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

Speculative decoding accelerates Large Language Models (LLMs) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model. However, existing draft model training objectives are…

Computation and Language · Computer Science 2026-05-12 Zihao An , Taichi Liu , Ziqiong Liu , Dong Li , Ruofeng Liu , Emad Barsoum

PaSS: Parallel Speculative Sampling

Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and…

Computation and Language · Computer Science 2023-11-23 Giovanni Monea , Armand Joulin , Edouard Grave

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach…

Computation and Language · Computer Science 2024-05-21 Hanling Yi , Feng Lin , Hongbin Li , Peiyang Ning , Xiaotian Yu , Rong Xiao

Speeding up Speculative Decoding via Sequential Approximate Verification

Speculative Decoding (SD) is a recently proposed technique for faster inference using Large Language Models (LLMs). SD operates by using a smaller draft LLM for autoregressively generating a sequence of tokens and a larger target LLM for…

Machine Learning · Computer Science 2025-07-09 Meiyu Zhong , Noel Teku , Ravi Tandon

Traversal Verification for Speculative Tree Decoding

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

Computation and Language · Computer Science 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of…

Computation and Language · Computer Science 2026-02-05 Ximing Dong , Shaowei Wang , Dayi Lin , Boyuan Chen , Ahmed E. Hassan

Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable…

Computation and Language · Computer Science 2025-12-15 Sergey Pankratov , Dan Alistarh

CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference

Speculative decoding (SD), where a draft model provides multiple candidate tokens for the target model to verify in parallel, has demonstrated significant potential for accelerating LLM inference. Yet, existing SD approaches adhere to a…

Machine Learning · Computer Science 2025-09-22 Enyu Zhou , Kai Sheng , Hao Chen , Xin He

SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in…

Machine Learning · Computer Science 2025-05-27 Yixuan Wang , Yijun Liu , Shiyu ji , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full…

Artificial Intelligence · Computer Science 2025-05-06 Bradley McDanel , Sai Qian Zhang , Yunhai Hu , Zining Liu

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes…

Computation and Language · Computer Science 2026-05-18 Shengyin Sun , Yiming Li , Renxi Liu , Xinqi Li , Hui-Ling Zhen , Weizhe Lin , Chen Chen , Xianzhi Yu , Mingxuan Yuan , Chen Ma

Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval…

Computation and Language · Computer Science 2025-03-03 Zilong Wang , Zifeng Wang , Long Le , Huaixiu Steven Zheng , Swaroop Mishra , Vincent Perot , Yuwei Zhang , Anush Mattapalli , Ankur Taly , Jingbo Shang , Chen-Yu Lee , Tomas Pfister

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-13 Jincheng Xie , Yawen Ling , Qi Xiao , Feiyu Zhang , Zhongyi Huang , Wen Hu , Yu Zheng