English
Related papers

Related papers: Efficient Adaptive Rejection Sampling for Accelera…

200 papers

The adaptive rejection sampling (ARS) algorithm is a universal random generator for drawing samples efficiently from a univariate log-concave target probability density function (pdf). ARS generates independent samples from the target via…

Computation · Statistics 2017-10-10 L. Martino , F. Louzada

Language Models (LMs) are increasingly used in applications where generated outputs must satisfy strict semantic or syntactic constraints. Existing approaches to constrained generation fall along a spectrum: greedy constrained decoding…

Artificial Intelligence · Computer Science 2025-10-03 Paweł Parys , Sairam Vaidya , Taylor Berg-Kirkpatrick , Loris D'Antoni

Large language model (LLM)-based automatic speech recognition (ASR) has recently attracted a lot of attention due to its high recognition accuracy and enhanced multi-dialect support. However, the high decoding latency of LLMs challenges the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-29 Linye Wei , Shuzhang Zhong , Songqiang Xu , Runsheng Wang , Ru Huang , Meng Li

Large Reasoning Language Models (LRLMs or LRMs) demonstrate remarkable capabilities in complex reasoning tasks, but suffer from significant computational inefficiencies due to overthinking phenomena. Existing efficient reasoning methods…

Artificial Intelligence · Computer Science 2025-10-13 Dongqi Zheng

Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the…

Machine Learning · Computer Science 2026-04-14 Jingwei Song , Xinyu Wang , Hanbin Wang , Xiaoxuan Lei , Bill Shi , Shixin Han , Eric Yang , Xiao-Wen Chang , Lynn Ai

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zili Wang , Robert Zhang , Kun Ding , Qi Yang , Fei Li , Shiming Xiang

This thesis presents Regenerative Rejection Sampling (RRS), a novel approximate sampling algorithm inspired by classical Rejection Sampling and Markov Chain Monte Carlo methods. The method constructs a continuous-time regenerative process…

Computation · Statistics 2026-04-01 Tommaso Bozzi

Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the…

Computation and Language · Computer Science 2025-09-15 Jikai Wang , Zhenxu Tian , Juntao Li , Qingrong Xia , Xinyu Duan , Zhefeng Wang , Baoxing Huai , Min Zhang

Speculative sampling (SpS) has been successful in accelerating the decoding throughput of auto-regressive large language models by leveraging smaller draft models. SpS strictly enforces the generated distribution to match that of the…

Machine Learning · Computer Science 2026-04-08 Yongchang Hao , Lili Mou

Large language models (LLMs) are increasingly being deployed in cost and latency-sensitive settings. While chain-of-thought improves reasoning, it can waste tokens on simple requests. We study selective thinking for tool-using LLMs and…

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Visual Autoregressive (VAR) modeling approach for image generation proposes autoregressive processing across hierarchical scales, decoding multiple tokens per scale in parallel. This method achieves high-quality generation while…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Hong-Kai Zheng , Piji Li

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

Machine Learning · Computer Science 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

Speculative decoding accelerates LLM inference by using a fast draft model to generate tokens and a more accurate target model to verify them. Its performance depends on the $\textit{acceptance length}$, or number of draft tokens accepted…

Computation and Language · Computer Science 2026-05-20 Avinash Kumar , Sujay Sanghavi , Poulami Das

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked…

Computation and Language · Computer Science 2024-10-15 Yunsheng Ni , Chuanjian Liu , Yehui Tang , Kai Han , Yunhe Wang

Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft…

Computation and Language · Computer Science 2026-03-05 Jeongtae Lee , Minjung Jo , Hyunjoon Jeong , Gunho Park , Sunghyeon Woo , Joonghoon Kim , Se Jung Kwon , Dongsoo Lee

Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and…

Machine Learning · Computer Science 2026-03-11 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Yuhao Chen , Qingyu Zhang , Jixiang Luo , Xuelong Li , Rongrong Ji
‹ Prev 1 2 3 10 Next ›