Related papers: Efficient Adaptive Rejection Sampling for Accelera…

Adaptive Rejection Sampling with fixed number of nodes

The adaptive rejection sampling (ARS) algorithm is a universal random generator for drawing samples efficiently from a univariate log-concave target probability density function (pdf). ARS generates independent samples from the target via…

Computation · Statistics 2017-10-10 L. Martino , F. Louzada

Constrained Adaptive Rejection Sampling

Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding…

Artificial Intelligence · Computer Science 2025-10-03 Paweł Parys , Sairam Vaidya , Taylor Berg-Kirkpatrick , Loris D'Antoni

SpecASR: Accelerating LLM-based Automatic Speech Recognition via Speculative Decoding

Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-29 Linye Wei , Shuzhang Zhong , Songqiang Xu , Runsheng Wang , Ru Huang , Meng Li

ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models

Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods…

Artificial Intelligence · Computer Science 2025-10-13 Dongqi Zheng

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the…

Machine Learning · Computer Science 2026-04-14 Jingwei Song , Xinyu Wang , Hanbin Wang , Xiaoxuan Lei , Bill Shi , Shixin Han , Eric Yang , Xiao-Wen Chang , Lynn Ai

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

Continuous Speculative Decoding for Autoregressive Image Generation

Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zili Wang , Robert Zhang , Kun Ding , Qi Yang , Fei Li , Shiming Xiang

Regenerative Rejection Sampling

This thesis presents Regenerative Rejection Sampling (RRS), a novel approximate sampling algorithm inspired by classical Rejection Sampling and Markov Chain Monte Carlo methods. The method constructs a continuous-time regenerative process…

Computation · Statistics 2026-04-01 Tommaso Bozzi

Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the…

Computation and Language · Computer Science 2025-09-15 Jikai Wang , Zhenxu Tian , Juntao Li , Qingrong Xia , Xinyu Duan , Zhefeng Wang , Baoxing Huai , Min Zhang

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the…

Machine Learning · Computer Science 2026-04-08 Yongchang Hao , Lili Mou

Ada-RS: Adaptive Rejection Sampling for Selective Thinking

Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and…

Artificial Intelligence · Computer Science 2026-02-24 Yirou Ge , Yixi Li , Alec Chiu , Shivani Shekhar , Zijie Pan , Avinash Thangali , Yun-Shiuan Chuang , Chaitanya Kulkarni , Uma Kona , Linsey Pang , Prakhar Mehrotra

Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Machine Learning · Computer Science 2025-02-03 Gregor Bachmann , Sotiris Anagnostidis , Albert Pumarola , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Edgar Schönfeld , Ali Thabet , Jonas Kohler

LSRS: Latent Scale Rejection Sampling for Visual Autoregressive Modeling

Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Hong-Kai Zheng , Piji Li

DEER: Draft with Diffusion, Verify with Autoregressive Models

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

Machine Learning · Computer Science 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

Test-Time Speculation

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted…

Computation and Language · Computer Science 2026-05-20 Avinash Kumar , Sujay Sanghavi , Poulami Das

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked…

Computation and Language · Computer Science 2024-10-15 Yunsheng Ni , Chuanjian Liu , Yehui Tang , Kai Han , Yunhe Wang

Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding

Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft…

Computation and Language · Computer Science 2026-03-05 Jeongtae Lee , Minjung Jo , Hyunjoon Jeong , Gunho Park , Sunghyeon Woo , Joonghoon Kim , Se Jung Kwon , Dongsoo Lee

Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and…

Machine Learning · Computer Science 2026-03-11 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Yuhao Chen , Qingyu Zhang , Jixiang Luo , Xuelong Li , Rongrong Ji