Related papers: P-EAGLE: Parallel-Drafting EAGLE with Scalable Tra…

PEARL: Parallel Speculative Decoding with Adaptive Draft Length

Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However,…

Computation and Language · Computer Science 2025-02-18 Tianyu Liu , Yun Li , Qitan Lv , Kai Liu , Jianchen Zhu , Winston Hu , Xiao Sun

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive…

Machine Learning · Computer Science 2025-09-26 Haiduo Huang , Jiangcheng Song , Wenzhe Zhao , Pengju Ren

PARD: Accelerating LLM Inference with Low-Cost PARallel Draft Model Adaptation

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features…

Computation and Language · Computer Science 2025-04-24 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly…

Computation and Language · Computer Science 2024-07-02 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

Cross-Attention Speculative Decoding

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on…

Computation and Language · Computer Science 2026-02-12 Wei Zhong , Manasa Bharadwaj , Yixiao Wang , Yipeng Ji , Chul Lee

ParallelVLM: Lossless Video-LLM Acceleration with Visual Alignment Aware Parallel Speculative Decoding

Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Quan Kong , Yuhao Shen , Yicheng Ji , Huan Li , Cong Wang

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call…

Machine Learning · Computer Science 2026-05-12 Doğaç Eldenk , Payal Mohapatra , Yigitcan Comlek , Kaan Oktay , Hongyang Zhang , Stephen Xia

Efficient Document Parsing via Parallel Token Prediction

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing…

Computation and Language · Computer Science 2026-03-17 Lei Li , Ze Zhao , Meng Li , Zhongwang Lun , Yi Yuan , Xingjing Lu , Zheng Wei , Jiang Bian , Zang Li

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

Computation and Language · Computer Science 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3…

Computation and Language · Computer Science 2026-05-22 Weijie Shi , Qiang Xu , Fan Deng , Yaguang Wu , Jiarun Liu , Yehong Xu , Hao Chen , Jia Zhu , Jiajie Xu , Xiangjun Huang , Jian Yang , Xiaofang Zhou

OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads…

Computation and Language · Computer Science 2025-10-10 Jaeseong Lee , seung-won hwang , Aurick Qiao , Gabriele Oliaro , Ye Wang , Samyam Rajbhandari

DEER: Draft with Diffusion, Verify with Autoregressive Models

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

Machine Learning · Computer Science 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer)…

Machine Learning · Computer Science 2025-03-05 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement…

Machine Learning · Computer Science 2026-03-23 Qinghao Hu , Shang Yang , Junxian Guo , Xiaozhe Yao , Yujun Lin , Yuxian Gu , Han Cai , Chuang Gan , Ana Klimovic , Song Han

SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However,…

Machine Learning · Computer Science 2026-03-20 Shenggui Li , Chao Wang , Yikai Zhu , Yubo Wang , Fan Yin , Shuai Shi , Yefei Chen , Xiaomin Dong , Qiaoling Chen , Jin Pan , Ji Li , Laixin Xie , Yineng Zhang , Lei Yu , Yonggang Wen , Ivor Tsang , Tianwei Zhang

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward…

Machine Learning · Computer Science 2026-05-20 Tianyu Wu , Yu Yao , Zhenting Qi , Han Zheng , Zhuohan Wang , Haoran Ma , Lawrence Liao , Himabindu Lakkaraju , Ju Li , Yilun Du

SpecVLM: Fast Speculative Decoding in Vision-Language Models

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haiduo Huang , Fuwei Yang , Zhenhua Liu , Xuanwu Yin , Dong Li , Pengju Ren , Emad Barsoum