English
Related papers

Related papers: Speeding up Speculative Decoding via Sequential Ap…

200 papers

LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target…

Computation and Language · Computer Science 2026-04-21 Sungkyun Kim , Jaemin Kim , Dogyung Yoon , Jiho Shin , Junyeol Lee , Jiwon Seo

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative…

Computation and Language · Computer Science 2025-03-04 Heming Xia , Cunxiao Du , Yongqi Li , Qian Liu , Wenjie Li

Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the…

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch…

Computation and Language · Computer Science 2026-03-19 Xiaoxuan Liu , Jiaxiang Yu , Jongseok Park , Ion Stoica , Alvin Cheung

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its…

Computation and Language · Computer Science 2025-05-30 Milan Gritta , Huiyin Xue , Gerasimos Lampouras

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in…

Machine Learning · Computer Science 2025-05-27 Yixuan Wang , Yijun Liu , Shiyu ji , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups…

Machine Learning · Computer Science 2026-05-15 Linghao Kong , Megan Flynn , Michael Peng , Nir Shavit , Mark Kurtz , Alexandre Marques

The synergistic mechanism based on Speculative Decoding (SD) has garnered considerable attention as a simple yet effective approach for accelerating the inference of large language models (LLMs). Nonetheless, the high rejection rates…

Computation and Language · Computer Science 2025-06-04 Chuanghao Ding , Jiaping Wang , Ziqing Yang , Xiaoliang Wang , Dahua Lin , Cam-Tu Nguyen , Fei Tan

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft…

Machine Learning · Computer Science 2025-10-24 Clara Mohri , Haim Kaplan , Tal Schuster , Yishay Mansour , Amir Globerson

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens.…

Computation and Language · Computer Science 2026-04-08 Taehyeon Kim , Hojung Jung , Se-Young Yun

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the…

Machine Learning · Computer Science 2026-04-14 Jingwei Song , Xinyu Wang , Hanbin Wang , Xiaoxuan Lei , Bill Shi , Shixin Han , Eric Yang , Xiao-Wen Chang , Lynn Ai

Speculative decoding accelerates Large Language Model (LLM) inference by employing a small speculative model (SSM) to generate multiple candidate tokens and verify them using the LLM in parallel. This technique has been widely integrated…

Computation and Language · Computer Science 2025-05-26 Ruixiao Li , Fahao Chen , Peng Li

Speculative decoding (SD) has emerged as a widely used paradigm to accelerate LLM inference without compromising quality. It works by first employing a compact model to draft multiple tokens efficiently and then using the target LLM to…

Computation and Language · Computer Science 2025-03-07 Heming Xia , Yongqi Li , Jun Zhang , Cunxiao Du , Wenjie Li

Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token…

Computation and Language · Computer Science 2024-07-03 Parsa Kavehzadeh , Mohammadreza Pourreza , Mojtaba Valipour , Tinashu Zhu , Haoli Bai , Ali Ghodsi , Boxing Chen , Mehdi Rezagholizadeh

Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens,…

Computation and Language · Computer Science 2026-02-03 Situo Zhang , Yifan Zhang , Zichen Zhu , Hankun Wang , Da Ma , Danyang Zhang , Lu Chen , Kai Yu
‹ Prev 1 2 3 10 Next ›