English
Related papers

Related papers: PaSS: Parallel Speculative Sampling

200 papers

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP…

Computation and Language · Computer Science 2026-03-06 Felix Draxler , Justus Will , Farrin Marouf Sofian , Theofanis Karaletsos , Sameer Singh , Stephan Mandt

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of…

Computation and Language · Computer Science 2023-02-03 Charlie Chen , Sebastian Borgeaud , Geoffrey Irving , Jean-Baptiste Lespiau , Laurent Sifre , John Jumper

Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming. One way to speed them up is speculative decoding, which generates candidate segments (a…

Computation and Language · Computer Science 2024-01-15 Sen Yang , Shujian Huang , Xinyu Dai , Jiajun Chen

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

We propose a novel speculative decoding method tailored for multi-sample reasoning scenarios, such as self-consistency and Best-of-N sampling. Our method exploits the intrinsic consensus of parallel generation paths to synthesize…

Computation and Language · Computer Science 2025-03-10 Yiwei Li , Jiayi Shi , Shaoxiong Feng , Peiwen Yuan , Xinglin Wang , Yueqi Zhang , Ji Zhang , Chuyi Tan , Boyuan Pan , Yao Hu , Kan Li

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked…

Computation and Language · Computer Science 2024-10-15 Yunsheng Ni , Chuanjian Liu , Yehui Tang , Kai Han , Yunhe Wang

Speculative decoding (SD), where an extra draft model is employed to provide multiple draft tokens first, and then the original target model verifies these tokens in parallel, has shown great power for LLM inference acceleration. However,…

Computation and Language · Computer Science 2025-02-18 Tianyu Liu , Yun Li , Qitan Lv , Kai Liu , Jianchen Zhu , Winston Hu , Xiao Sun

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any…

Machine Learning · Computer Science 2023-05-22 Yaniv Leviathan , Matan Kalman , Yossi Matias

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Large language models (LLMs) have achieved impressive results on multi-step mathematical reasoning, yet at the cost of high computational overhead. This challenge is particularly acute for test-time scaling methods such as parallel…

Machine Learning · Computer Science 2026-03-24 Yuanlin Chu , Bo Wang , Xiang Liu , Hong Chen , Aiwei Liu , Xuming Hu

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications…

It is commonly believed that scaling language models should commit a significant space or time cost, by increasing the parameters (parameter scaling) or output tokens (inference-time scaling). We introduce the third and more…

Machine Learning · Computer Science 2025-05-16 Mouxiang Chen , Binyuan Hui , Zeyu Cui , Jiaxi Yang , Dayiheng Liu , Jianling Sun , Junyang Lin , Zhongxin Liu

Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and…

Computation and Language · Computer Science 2025-08-27 Yijiong Yu

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to…

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable…

Computation and Language · Computer Science 2025-12-15 Sergey Pankratov , Dan Alistarh

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks.…

Machine Learning · Computer Science 2024-01-19 Ziteng Sun , Ananda Theertha Suresh , Jae Hun Ro , Ahmad Beirami , Himanshu Jain , Felix Yu

Speculative decoding accelerates large language model inference using a smaller draft model. In this paper, we establish a surprising connection between speculative decoding and channel simulation, which aims at simulating a noisy channel…

Computation and Language · Computer Science 2025-04-23 Szymon Kobus , Deniz Gündüz
‹ Prev 1 2 3 10 Next ›