Related papers: Clover: Regressive Lightweight Speculative Decodin…

Clover-2: Accurate Inference for Regressive Lightweight Speculative Decoding

Large Language Models (LLMs) frequently suffer from inefficiencies, largely attributable to the discord between the requirements of auto-regressive decoding and the architecture of contemporary GPUs. Recently, regressive lightweight…

Computation and Language · Computer Science 2024-08-02 Bin Xiao , Lujun Gui , Lei Su , Weipeng Chen

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

Large Language Models (LLMs) like GPT are state-of-the-art text generation models that provide significant assistance in daily routines. However, LLM execution is inherently sequential, since they only produce one token at a time, thus…

Machine Learning · Computer Science 2023-10-31 Qidong Su , Christina Giannoula , Gennady Pekhimenko

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have…

Computation and Language · Computer Science 2024-10-18 Yuxuan Liu , Wenyuan Li , Laizhong Cui , Hailiang Yang

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Beyond the Speculative Game: A Survey of Speculative Execution in Large Language Models

With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be…

Computation and Language · Computer Science 2024-04-24 Chen Zhang , Zhuorui Liu , Dawei Song

Accelerating LLM Inference with Staged Speculative Decoding

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload

Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-25 Ziyue Liu , Zhengyang Wang , Ruijie Zhang , Avinash Maurya , Hui Zhou , Paul Hovland , Sheng Di , Franck Cappello , Bogdan Nicolae , Zheng Zhang

SpecMemo: Speculative Decoding is in Your Pocket

Recent advancements in speculative decoding have demonstrated considerable speedup across a wide array of large language model (LLM) tasks. Speculative decoding inherently relies on sacrificing extra memory allocations to generate several…

Machine Learning · Computer Science 2025-06-04 Selin Yildirim , Deming Chen

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

Computation and Language · Computer Science 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving

Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-04 Rui Li , Zhaoning Zhang , Libo Zhang , Huaimin Wang , Xiang Fu , Zhiquan Lai

Lever: Speculative LLM Inference on Smartphones

Large language models (LLMs) are increasingly needed for interactive mobile applications, but high-quality models exceed the limited DRAM available on smartphones. Flash storage can hold larger models, yet flash-backed inference is slow…

Machine Learning · Computer Science 2026-05-19 Tuowei Wang , Fengzu Li , Yanfan Sun , Wei Gao , Ju Ren

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the…

Computation and Language · Computer Science 2025-03-21 Shibo Jie , Yehui Tang , Kai Han , Zhi-Hong Deng , Jing Han

Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of…

Computation and Language · Computer Science 2026-02-05 Ximing Dong , Shaowei Wang , Dayi Lin , Boyuan Chen , Ahmed E. Hassan

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited…

Computation and Language · Computer Science 2024-07-26 Haoran You , Yichao Fu , Zheng Wang , Amir Yazdanbakhsh , Yingyan Celine Lin