Related papers: Learning Harmonized Representations for Speculativ…

HADES: Hardware Accelerated Decoding for Efficient Speculation in Large Language Models

Large Language Models (LLMs) have revolutionized natural language processing by understanding and generating human-like text. However, the increasing demand for more sophisticated LLMs presents significant computational challenges due to…

Computation and Language · Computer Science 2025-01-14 Ze Yang , Yihong Jin , Xinhe Xu

CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making…

Machine Learning · Computer Science 2026-05-13 Yuning Han , Yangchenchen Jin , Dylan Zhao , Jingwei Sun

HaS: Accelerating RAG through Homology-Aware Speculative Retrieval

Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases…

Information Retrieval · Computer Science 2026-04-23 Peng Peng , Weiwei Lin , Wentai Wu , Xinyang Wang , Yongheng Liu

Fast Inference via Hierarchical Speculative Decoding

Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft…

Machine Learning · Computer Science 2025-10-24 Clara Mohri , Haim Kaplan , Tal Schuster , Yishay Mansour , Amir Globerson

Automatic Task Detection and Heterogeneous LLM Speculative Decoding

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

Computation and Language · Computer Science 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding

Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions, as they have been widely incorporated into real-world services. Speculative decoding, a fully algorithmic solution, has gained attention for…

Computation and Language · Computer Science 2025-02-11 Sukmin Cho , Sangjin Choi , Taeho Hwang , Jeongyeon Seo , Soyeong Jeong , Huije Lee , Hoyun Song , Jong C. Park , Youngjin Kwon

HSD: Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Wenhui Liao , Hongliang Li , Pengyu Xie , Xinyu Cai , Yufan Shen , Yi Xin , Qi Qin , Shenglong Ye , Tianbin Li , Ming Hu , Junjun He , Yihao Liu , Wenhai Wang , Min Dou , Bin Fu , Botian Shi , Yu Qiao , Lianwen Jin

Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the…

Computation and Language · Computer Science 2025-09-15 Jikai Wang , Zhenxu Tian , Juntao Li , Qingrong Xia , Xinyu Duan , Zhefeng Wang , Baoxing Huai , Min Zhang

When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-29 Linye Wei , Shuzhang Zhong , Songqiang Xu , Runsheng Wang , Ru Huang , Meng Li

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Fast Large Language Model Collaborative Decoding via Speculation

Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative…

Computation and Language · Computer Science 2025-05-30 Jiale Fu , Yuchu Jiang , Junkai Chen , Jiaming Fan , Xin Geng , Xu Yang

Accelerating LLM Inference with Staged Speculative Decoding

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

HiSpec: Hierarchical Speculative Decoding for LLMs

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a…

Computation and Language · Computer Science 2026-05-27 Avinash Kumar , Sujay Sanghavi , Poulami Das

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to…

Artificial Intelligence · Computer Science 2026-03-03 Yuxuan Zhou , Fei Huang , Heng Li , Fengyi Wu , Tianyu Wang , Jianwei Zhang , Junyang Lin , Zhi-Qi Cheng

Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in…

Computation and Language · Computer Science 2025-12-18 Chendong Sun , Ali Mao , Lei Xu , mingmin Chen

SPIN: Accelerating Large Language Model Inference with Heterogeneous Speculative Models

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Machine Learning · Computer Science 2025-02-03 Gregor Bachmann , Sotiris Anagnostidis , Albert Pumarola , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Edgar Schönfeld , Ali Thabet , Jonas Kohler

Decoding Speculative Decoding

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman