Related papers: REST: Retrieval-Based Speculative Decoding

DReSD: Dense Retrieval for Speculative Decoding

Speculative decoding (SD) accelerates Large Language Model (LLM) generation by using an efficient draft model to propose the next few tokens, which are verified by the LLM in a single forward call, reducing latency while preserving its…

Computation and Language · Computer Science 2025-05-30 Milan Gritta , Huiyin Xue , Gerasimos Lampouras

CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most…

Computation and Language · Computer Science 2024-08-12 Sophia Ho , Jinsol Park , Patrick Wang

RASD: Retrieval-Augmented Speculative Decoding

Speculative decoding accelerates inference in large language models (LLMs) by generating draft tokens for target model verification. Current approaches for obtaining draft tokens rely on lightweight draft models or additional model…

Computation and Language · Computer Science 2025-03-06 Guofeng Quan , Wenfeng Feng , Chuzhan Hao , Guochao Jiang , Yuewei Zhang , Hao Wang

When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Speculative decoding is an inference-acceleration method for large language models (LLMs) where a small language model generates a draft-token sequence which is further verified by the target LLM in parallel. Recent works have advanced this…

Machine Learning · Computer Science 2024-03-06 Wonseok Jeon , Mukul Gagrani , Raghavv Goel , Junyoung Park , Mingu Lee , Christopher Lott

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

SRT: Accelerating Reinforcement Learning via Speculative Rollout with Tree-Structured Cache

We present Speculative Rollout with Tree-Structured Cache (SRT), a simple, model-free approach to accelerate on-policy reinforcement learning (RL) for language models without sacrificing distributional correctness. SRT exploits the…

Machine Learning · Computer Science 2026-01-15 Chi-Chih Chang , Siqi Zhu , Zhichen Zeng , Haibin Lin , Jiaxuan You , Mohamed S. Abdelfattah , Ziheng Jiang , Xuehai Qian

RASST: Fast Cross-modal Retrieval-Augmented Simultaneous Speech Translation

Simultaneous speech translation (SST) produces target text incrementally from partial speech input. Recent speech large language models (Speech LLMs) have substantially improved SST quality, yet they still struggle to correctly translate…

Computation and Language · Computer Science 2026-02-02 Jiaxuan Luo , Siqi Ouyang , Lei Li

REST: REtrieve & Self-Train for generative action recognition

This work is on training a generative action/video recognition model whose output is a free-form action-specific caption describing the video (rather than an action class label). A generative approach has practical advantages like producing…

Computer Vision and Pattern Recognition · Computer Science 2022-09-30 Adrian Bulat , Enrique Sanchez , Brais Martinez , Georgios Tzimiropoulos

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Autoregressive decoding in Large Language Models (LLMs) generates one token per step, causing high inference latency. Speculative decoding (SD) mitigates this through a guess-and-verify strategy, but existing training-free variants face…

Computation and Language · Computer Science 2026-04-17 Zihong Zhang , Zuchao Li , Lefei Zhang , Ping Wang , Hai Zhao

Text Simplification by Tagging

Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are…

Computation and Language · Computer Science 2022-05-11 Kostiantyn Omelianchuk , Vipul Raheja , Oleksandr Skurzhanskyi

Make Every Draft Count: Hidden State based Speculative Decoding

Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this…

Computation and Language · Computer Science 2026-02-26 Yuetao Chen , Xuliang Wang , Xinzhou Zheng , Ming Li , Peng Wang , Hong Xu

Learning to Better Search with Language Models via Guided Reinforced Self-Training

While language models have shown remarkable performance across diverse tasks, they still encounter challenges in complex reasoning scenarios. Recent research suggests that language models trained on linearized search traces toward…

Artificial Intelligence · Computer Science 2025-10-28 Seungyong Moon , Bumsoo Park , Hyun Oh Song

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target…

Computation and Language · Computer Science 2025-06-27 Baohao Liao , Yuhui Xu , Hanze Dong , Junnan Li , Christof Monz , Silvio Savarese , Doyen Sahoo , Caiming Xiong

Diversify Question Generation with Retrieval-Augmented Style Transfer

Given a textual passage and an answer, humans are able to ask questions with various expressions, but this ability is still challenging for most question generation (QG) systems. Existing solutions mainly focus on the internal knowledge…

Computation and Language · Computer Science 2023-10-24 Qi Gou , Zehua Xia , Bowen Yu , Haiyang Yu , Fei Huang , Yongbin Li , Nguyen Cam-Tu

SpecTr: Fast Speculative Decoding via Optimal Transport

Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks.…

Machine Learning · Computer Science 2024-01-19 Ziteng Sun , Ananda Theertha Suresh , Jae Hun Ro , Ahmad Beirami , Himanshu Jain , Felix Yu

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its…

Computation and Language · Computer Science 2025-04-28 Minghan Li , Xilun Chen , Ari Holtzman , Beidi Chen , Jimmy Lin , Wen-tau Yih , Xi Victoria Lin

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating…

Computation and Language · Computer Science 2026-05-22 Woomin Song , Saket Dingliwal , Sai Muralidhar Jayanthi , Bhavana Ganesh , Jinwoo Shin , Aram Galstyan , Sravan Babu Bodapati

RecycleGPT: An Autoregressive Language Model with Recyclable Module

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the…

Computation and Language · Computer Science 2024-05-24 Yufan Jiang , Qiaozhi He , Xiaomin Zhuang , Zhihua Wu , Kunpeng Wang , Wenlai Zhao , Guangwen Yang

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an…

Computation and Language · Computer Science 2023-10-31 Heming Xia , Tao Ge , Peiyi Wang , Si-Qing Chen , Furu Wei , Zhifang Sui