English
Related papers

Related papers: Dynamic-Width Speculative Beam Decoding for Effici…

200 papers

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy.…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Rongrong Ji

Deployment of autoregressive large language models (LLMs) is costly, and as these models increase in size, the associated costs will become even more considerable. Consequently, different methods have been proposed to accelerate the token…

Computation and Language · Computer Science 2024-07-03 Parsa Kavehzadeh , Mohammadreza Pourreza , Mojtaba Valipour , Tinashu Zhu , Haoli Bai , Ali Ghodsi , Boxing Chen , Mehdi Rezagholizadeh

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative…

Computation and Language · Computer Science 2025-03-04 Heming Xia , Cunxiao Du , Yongqi Li , Qian Liu , Wenjie Li

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

Computation and Language · Computer Science 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly…

Computation and Language · Computer Science 2026-01-06 Linxiao Zeng , Haoyun Deng , Kangyuan Shu , Shizhen Wang

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results…

Computation and Language · Computer Science 2025-10-07 Yifeng Gao , Ziang Ji , Yuxuan Wang , Biqing Qi , Hanlin Xu , Linfeng Zhang

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain…

Machine Learning · Computer Science 2025-12-02 Fengze Yu , Leshu Li , Brad McDanel , Sai Qian Zhang

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

Information Theory · Computer Science 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in…

Computation and Language · Computer Science 2024-11-12 Euiin Yi , Taehyeon Kim , Hongseok Jeung , Du-Seong Chang , Se-Young Yun

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

Computation and Language · Computer Science 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens.…

Computation and Language · Computer Science 2026-04-08 Taehyeon Kim , Hojung Jung , Se-Young Yun

Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Yuhao Shen , Junyi Shen , Quan Kong , Tianyu Liu , Yao Lu , Cong Wang

Large language models (LLMs) have transformed natural language processing but face critical deployment challenges in device-edge systems due to resource limitations and communication overhead. To address these issues, collaborative…

Signal Processing · Electrical Eng. & Systems 2025-07-18 Jiahong Ning , Ce Zheng , Tingting Yang

Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness…

Computation and Language · Computer Science 2024-07-24 Zhuocheng Gong , Jiahao Liu , Ziyue Wang , Pengfei Wu , Jingang Wang , Xunliang Cai , Dongyan Zhao , Rui Yan

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel
‹ Prev 1 2 3 10 Next ›