Related papers: C2T: A Classifier-Based Tree Construction Method i…

OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure

Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become…

Computation and Language · Computer Science 2025-04-25 Jikai Wang , Yi Su , Juntao Li , Qingrong Xia , Zi Ye , Xinyu Duan , Zhefeng Wang , Min Zhang

Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models

Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size. To address this, speculative decoding emerges as a solution, enabling the simultaneous generation and…

Computation and Language · Computer Science 2026-02-27 Yinrong Hong , Zhiquan Tan , Kai Hu

Traversal Verification for Speculative Tree Decoding

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

Computation and Language · Computer Science 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time…

Computation and Language · Computer Science 2026-03-03 Jiebin Zhang , Zhenghan Yu , Liang Wang , Nan Yang , Eugene J. Yu , Zheng Li , Yifan Song , Dawei Zhu , Xingxing Zhang , Furu Wei , Sujian Li

Accelerating LLM Inference with Staged Speculative Decoding

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly…

Computation and Language · Computer Science 2024-07-02 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict…

Computation and Language · Computer Science 2024-04-02 Xupeng Miao , Gabriele Oliaro , Zhihao Zhang , Xinhao Cheng , Zeyu Wang , Zhengxin Zhang , Rae Ying Yee Wong , Alan Zhu , Lijie Yang , Xiaoxiang Shi , Chunan Shi , Zhuoming Chen , Daiyaan Arfeen , Reyna Abhyankar , Zhihao Jia

TALON: Confidence-Aware Speculative Decoding with Adaptive Token Trees

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured…

Computation and Language · Computer Science 2026-01-13 Tianyu Liu , Qitan Lv , Yuhao Shen , Xiao Sun , Xiaoyan Sun

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which…

Computation and Language · Computer Science 2025-05-30 Yudi Zhang , Weilin Zhao , Xu Han , Tiejun Zhao , Wang Xu , Hailong Cao , Conghui Zhu

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures…

Computation and Language · Computer Science 2025-05-27 Xianzhen Luo , Yixuan Wang , Qingfu Zhu , Zhiming Zhang , Xuanyu Zhang , Qing Yang , Dongliang Xu

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

G2T-LLM: Graph-to-Tree Text Encoding for Molecule Generation with Fine-Tuned Large Language Models

We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding…

Machine Learning · Computer Science 2024-10-04 Zhaoning Yu , Xiangyang Xu , Hongyang Gao

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

We propose an acceleration scheme for large language models (LLMs) through Speculative Decoding with Semantic Adaptive Tokens (SDSAT). The primary objective of this design is to enhance the LLM model's ability to generate draft tokens more…

Computation and Language · Computer Science 2024-04-02 Chengbo Liu , Yong Zhu

Seed-CTS: Unleashing the Power of Tree Search for Superior Performance in Competitive Coding Tasks

Competition-level code generation tasks pose significant challenges for current state-of-the-art large language models (LLMs). For example, on the LiveCodeBench-Hard dataset, models such as O1-Mini and O1-Preview achieve pass@1 rates of…

Artificial Intelligence · Computer Science 2024-12-31 Hao Wang , Boyi Liu , Yufeng Zhang , Jie Chen

Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable…

Computation and Language · Computer Science 2025-12-15 Sergey Pankratov , Dan Alistarh

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token…

Computation and Language · Computer Science 2024-07-03 Parsa Kavehzadeh , Mohammadreza Pourreza , Mojtaba Valipour , Tinashu Zhu , Haoli Bai , Ali Ghodsi , Boxing Chen , Mehdi Rezagholizadeh

Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Multi-Candidate Speculative Decoding

Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a…

Computation and Language · Computer Science 2024-01-15 Sen Yang , Shujian Huang , Xinyu Dai , Jiajun Chen

ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding

Recent advancements in generative large language models (LLMs) have significantly boosted the performance in natural language processing tasks. However, their efficiency is hampered by the inherent limitations in autoregressive token…

Machine Learning · Computer Science 2024-02-22 Shuzhang Zhong , Zebin Yang , Meng Li , Ruihao Gong , Runsheng Wang , Ru Huang