English
Related papers

Related papers: TPP-SD: Accelerating Transformer Point Process Sam…

200 papers

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Temporal point processes are powerful generative models for event sequences that capture complex dependencies in time-series data. They are commonly specified using autoregressive models that learn the distribution of the next event from…

Machine Learning · Computer Science 2025-10-24 Marin Biloš , Anderson Schneider , Yuriy Nevmyvaka

Speculative Decoding has emerged as a popular technique for accelerating inference in Large Language Models. However, most existing approaches yield only modest improvements in production serving systems. Methods that achieve substantial…

Computation and Language · Computer Science 2026-01-08 Michele Marzollo , Jiawei Zhuang , Niklas Roemer , Niklas Zwingenberger , Lorenz K. Müller , Lukas Cavigelli

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of…

Computation and Language · Computer Science 2023-02-03 Charlie Chen , Sebastian Borgeaud , Geoffrey Irving , Jean-Baptiste Lespiau , Laurent Sifre , John Jumper

Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft…

Machine Learning · Computer Science 2025-10-24 Clara Mohri , Haim Kaplan , Tal Schuster , Yishay Mansour , Amir Globerson

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative…

Computation and Language · Computer Science 2025-03-04 Heming Xia , Cunxiao Du , Yongqi Li , Qian Liu , Wenjie Li

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency,…

Computation and Language · Computer Science 2024-12-17 Xiaofan Lu , Yixiao Zeng , Feiyang Ma , Zixu Yu , Marco Levorato

Modern autoregressive speech synthesis models leveraging language models have demonstrated remarkable performance. However, the sequential nature of next token prediction in these models leads to significant latency, hindering their…

Sound · Computer Science 2025-06-04 Zijian Lin , Yang Zhang , Yougen Yuan , Yuming Yan , Jinjiang Liu , Zhiyong Wu , Pengfei Hu , Qun Yu

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly…

Computation and Language · Computer Science 2026-01-06 Linxiao Zeng , Haoyun Deng , Kangyuan Shu , Shizhen Wang

Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental…

Computation and Language · Computer Science 2026-05-05 Sibo Xiao , Jinyuan Fu , Zhongle Xie , Lidan Shou

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel…

Computation and Language · Computer Science 2025-04-23 Szymon Kobus , Deniz Gündüz

Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and accepting or rejecting them based on the target model's distribution. While…

Machine Learning · Computer Science 2025-07-08 Valentin De Bortoli , Alexandre Galashov , Arthur Gretton , Arnaud Doucet

Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a…

Computation and Language · Computer Science 2024-01-15 Sen Yang , Shujian Huang , Xinyu Dai , Jiajun Chen

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

Information Theory · Computer Science 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

Diffusion Policy (DP) excels in embodied control but suffers from high inference latency and computational cost due to multiple iterative denoising steps. The temporal complexity of embodied tasks demands a dynamic and adaptable computation…

Machine Learning · Computer Science 2025-12-19 Ye Li , Jiahe Feng , Yuan Meng , Kangye Ji , Chen Tang , Xinwan Wen , Shutao Xia , Zhi Wang , Wenwu Zhu

Recently, speculative decoding (SD) has emerged as a promising technique to accelerate LLM inference by employing a small draft model to propose draft tokens in advance, and validating them in parallel with the large target model. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-15 Yuhao Shen , Junyi Shen , Quan Kong , Tianyu Liu , Yao Lu , Cong Wang

Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-14 Qitan Lv , Tianyu Liu , Wen Wu , Xuenan Xu , Bowen Zhou , Feng Wu , Chao Zhang
‹ Prev 1 2 3 10 Next ›