English
Related papers

Related papers: P-EAGLE: Parallel-Drafting EAGLE with Scalable Tra…

200 papers

Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However,…

Computation and Language · Computer Science 2025-02-18 Tianyu Liu , Yun Li , Qitan Lv , Kai Liu , Jianchen Zhu , Winston Hu , Xiao Sun

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive…

Machine Learning · Computer Science 2025-09-26 Haiduo Huang , Jiangcheng Song , Wenzhe Zhao , Pengju Ren

The autoregressive nature of large language models (LLMs) fundamentally limits inference speed, as each forward pass generates only a single token and is often bottlenecked by memory bandwidth. Speculative decoding has emerged as a…

Machine Learning · Computer Science 2025-12-02 Zihao An , Huajun Bai , Ziqiong Liu , Dong Li , Emad Barsoum

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features…

Computation and Language · Computer Science 2025-04-24 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly…

Computation and Language · Computer Science 2024-07-02 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on…

Computation and Language · Computer Science 2026-02-12 Wei Zhong , Manasa Bharadwaj , Yixiao Wang , Yipeng Ji , Chul Lee

Although current Video-LLMs achieve impressive performance in video understanding tasks, their autoregressive decoding efficiency remains constrained by the massive number of video tokens. Visual token pruning can partially ease this…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Quan Kong , Yuhao Shen , Yicheng Ji , Huan Li , Cong Wang

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call…

Machine Learning · Computer Science 2026-05-12 Doğaç Eldenk , Payal Mohapatra , Yigitcan Comlek , Kaan Oktay , Hongyang Zhang , Stephen Xia

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing…

Computation and Language · Computer Science 2026-03-17 Lei Li , Ze Zhao , Meng Li , Zhongwang Lun , Yi Yuan , Xingjing Lu , Zheng Wei , Jiang Bian , Zang Li

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

Computation and Language · Computer Science 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3…

Computation and Language · Computer Science 2026-05-22 Weijie Shi , Qiang Xu , Fan Deng , Yaguang Wu , Jiarun Liu , Yehong Xu , Hao Chen , Jia Zhu , Jiajie Xu , Xiangjun Huang , Jian Yang , Xiaofang Zhou

Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads…

Computation and Language · Computer Science 2025-10-10 Jaeseong Lee , seung-won hwang , Aurick Qiao , Gabriele Oliaro , Ye Wang , Samyam Rajbhandari

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

Machine Learning · Computer Science 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer)…

Machine Learning · Computer Science 2025-03-05 Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

The emergence of Large Language Models (LLMs) with strong reasoning capabilities marks a significant milestone, unlocking new frontiers in complex problem-solving. However, training these reasoning models, typically using Reinforcement…

Machine Learning · Computer Science 2026-03-23 Qinghao Hu , Shang Yang , Junxian Guo , Xiaozhe Yao , Yujun Lin , Yuxian Gu , Han Cai , Chuang Gan , Ana Klimovic , Song Han

Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However,…

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward…

Machine Learning · Computer Science 2026-05-20 Tianyu Wu , Yu Yao , Zhenting Qi , Han Zheng , Zhuohan Wang , Haoran Ma , Lawrence Liao , Himabindu Lakkaraju , Ju Li , Yilun Du

Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs), but directly porting it to vision-language models (VLMs) faces unique systems constraints: the prefill stage is dominated by visual tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-09-23 Haiduo Huang , Fuwei Yang , Zhenhua Liu , Xuanwu Yin , Dong Li , Pengju Ren , Emad Barsoum
‹ Prev 1 2 3 10 Next ›