English
Related papers

Related papers: Dynamic Delayed Tree Expansion For Improved Multi-…

200 papers

Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become…

Computation and Language · Computer Science 2025-04-25 Jikai Wang , Yi Su , Juntao Li , Qingrong Xia , Zi Ye , Xinyu Duan , Zhefeng Wang , Min Zhang

Speculative decoding is a promising approach for accelerating large language models. The primary idea is to use a lightweight draft model to speculate the output of the target model for multiple subsequent timesteps, and then verify them in…

Computation and Language · Computer Science 2025-11-06 Yepeng Weng , Qiao Hu , Xujie Chen , Li Liu , Dianwen Mei , Huishi Qiu , Jiang Tian , Zhongchao Shi

Speculative decoding accelerates Large Language Models via draft-then-verify, where verification can be framed as an Optimal Transport (OT) problem. Existing approaches typically handle multi-draft and multi-step aspects in isolation,…

Computation and Language · Computer Science 2026-05-07 Yepeng Weng , Qiao Hu , Takehisa Yairi

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised…

Computation and Language · Computer Science 2026-05-29 Haodi Lei , Yafy Li , Haoran Zhang , Shunkai Zhang , Qianjia Cheng , Xiaoye Qu , Ganqu Cui , Bowen Zhou , Ning Ding , Yun Luo , Yu Cheng

Speculative sampling reduces the latency of autoregressive decoding for target model LLMs without sacrificing inference quality, by using a cheap draft model to suggest a candidate token and a verification criterion to accept or resample…

Machine Learning · Computer Science 2025-11-21 Rahul Krishna Thomas , Arka Pal

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an…

Computation and Language · Computer Science 2026-04-15 Liran Ringel , Yaniv Romano

Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model…

Computation and Language · Computer Science 2025-03-06 Guofeng Quan , Wenfeng Feng , Chuzhan Hao , Guochao Jiang , Yuewei Zhang , Hao Wang

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks.…

Machine Learning · Computer Science 2024-01-19 Ziteng Sun , Ananda Theertha Suresh , Jae Hun Ro , Ahmad Beirami , Himanshu Jain , Felix Yu

Speculative decoding accelerates language model inference by separating generation into fast drafting and parallel verification. Its main limitation is drafter-verifier misalignment, which limits token acceptance and reduces overall…

Machine Learning · Computer Science 2025-11-14 Frédéric Berdoz , Peer Rheinboldt , Roger Wattenhofer

Speculative decoding accelerates large language model (LLM) inference by letting a lightweight draft model propose multiple tokens that the target model verifies in parallel. Yet existing training objectives optimize only a single greedy…

Computation and Language · Computer Science 2026-03-03 Shijing Hu , Jingyang Li , Zhihui Lu , Pan Zhou

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM…

Machine Learning · Computer Science 2026-05-20 Yuhao Shen , Tianyu Liu , Xinyi Hu , Quan Kong , Baolin Zhang , Jun Dai , Jun Zhang , Shuang Ge , Lei Chen , Yue Li , Mingcheng Wan , Cong Wang

Tree-based speculative decoding accelerates autoregressive generation by verifying multiple draft candidates in parallel, but this advantage weakens for sparse Mixture-of-Experts (MoE) models. As the draft tree grows, different branches…

Computation and Language · Computer Science 2026-05-04 Lehan Pan , Ziyang Tao , Ruoyu Pang , Xiao Wang , Jianjun Zhao , Yanyong Zhang

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

Machine Learning · Computer Science 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou

Speculative decoding (SD) has become a standard technique for accelerating LLM inference without sacrificing output quality. Recent advances in speculative decoding have shifted from sequential chain-based drafting to tree-structured…

Computation and Language · Computer Science 2026-01-13 Tianyu Liu , Qitan Lv , Yuhao Shen , Xiao Sun , Xiaoyan Sun

Large Language Models (LLMs) have become an indispensable part of natural language processing tasks. However, autoregressive sampling has become an efficiency bottleneck. Multi-Draft Speculative Decoding (MDSD) is a recent approach where,…

Computation and Language · Computer Science 2025-02-27 Zhengmian Hu , Tong Zheng , Vignesh Viswanathan , Ziyi Chen , Ryan A. Rossi , Yihan Wu , Dinesh Manocha , Heng Huang

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Speculative decoding has emerged as a widely adopted paradigm for accelerating large language model inference, where a lightweight draft model rapidly generates candidate tokens that are then verified in parallel by a larger target model.…

Machine Learning · Computer Science 2026-03-16 Yu-Yang Qian , Hao-Cong Wu , Yichao Fu , Hao Zhang , Peng Zhao

Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad…

Computation and Language · Computer Science 2026-03-31 Mohamad Zbib , Mohamad Bazzi , Ammar Mohanna , Hasan Abed Al Kader Hammoud , Bernard Ghanem

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid…

Computation and Language · Computer Science 2026-05-29 Jaydip Sen , Subhasis Dasgupta , Hetvi Waghela
‹ Prev 1 2 3 10 Next ›