Related papers: Multi-Scale Local Speculative Decoding for Image G…

Continuous Speculative Decoding for Autoregressive Image Generation

Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zili Wang , Robert Zhang , Kun Ding , Qi Yang , Fei Li , Shiming Xiang

Speculative Decoding Reimagined for Multimodal Large Language Models

This paper introduces Multimodal Speculative Decoding (MSD) to accelerate Multimodal Large Language Models (MLLMs) inference. Speculative decoding has been shown to accelerate Large Language Models (LLMs) without sacrificing accuracy.…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Luxi Lin , Zhihang Lin , Zhanpeng Zeng , Rongrong Ji

Grouped Speculative Decoding for Autoregressive Image Generation

Recently, autoregressive (AR) image models have demonstrated remarkable generative capabilities, positioning themselves as a compelling alternative to diffusion models. However, their sequential nature leads to long inference times,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Junhyuk So , Juncheol Shin , Hyunho Kook , Eunhyeok Park

Speculative Coupled Decoding for Training-Free Lossless Acceleration of Autoregressive Visual Generation

Autoregressive (AR) modeling has recently emerged as a promising new paradigm in visual generation, but its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of…

Computer Vision and Pattern Recognition · Computer Science 2026-05-06 Junhyuk So , Hyunho Kook , Chaeyeon Jang , Eunhyeok Park

Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency…

Computation and Language · Computer Science 2025-12-15 Nikhil Bhendawade , Kumari Nishu , Arnav Kundu , Chris Bartels , Minsik Cho , Irina Belousova

HIPPO: Accelerating Video Large Language Models Inference via Holistic-aware Parallel Speculative Decoding

Speculative decoding (SD) has emerged as a promising approach to accelerate LLM inference without sacrificing output quality. Existing SD methods tailored for video-LLMs primarily focus on pruning redundant visual tokens to mitigate the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-14 Qitan Lv , Tianyu Liu , Wen Wu , Xuenan Xu , Bowen Zhou , Feng Wu , Chao Zhang

Faster LLM Inference via Sequential Monte Carlo

Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the…

Machine Learning · Computer Science 2026-04-20 Yahya Emara , Mauricio Barba da Costa , Chi-Chih Chang , Cameron Freer , Tim Vieira , Ryan Cotterell , Mohamed S. Abdelfattah

Multi-Drafter Speculative Decoding with Alignment Feedback

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens.…

Computation and Language · Computer Science 2026-04-08 Taehyeon Kim , Hojung Jung , Se-Young Yun

Tutorial Proposal: Speculative Decoding for Efficient LLM Inference

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative…

Computation and Language · Computer Science 2025-03-04 Heming Xia , Cunxiao Du , Yongqi Li , Qian Liu , Wenjie Li

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less…

Machine Learning · Computer Science 2026-02-17 Zongle Huang , Lei Zhu , Zongyuan Zhan , Ting Hu , Weikai Mao , Xianzhi Yu , Yongpan Liu , Tianyu Zhang

AMUSD: Asynchronous Multi-Device Speculative Decoding for LLM Acceleration

Large language models typically generate tokens autoregressively, using each token as input for the next. Recent work on Speculative Decoding has sought to accelerate this process by employing a smaller, faster draft model to more quickly…

Computation and Language · Computer Science 2024-10-24 Bradley McDanel

Improving Multi-candidate Speculative Decoding

Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs) by using a lower complexity draft model to propose candidate tokens verified by a larger target model. To further improve efficiency,…

Computation and Language · Computer Science 2024-12-17 Xiaofan Lu , Yixiao Zeng , Feiyang Ma , Zixu Yu , Marco Levorato

MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification

Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification. While recent methods improve draft quality by tightly coupling the drafter with the target model, the…

Machine Learning · Computer Science 2026-04-14 Jingwei Song , Xinyu Wang , Hanbin Wang , Xiaoxuan Lei , Bill Shi , Shixin Han , Eric Yang , Xiao-Wen Chang , Lynn Ai

Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter…

Computation and Language · Computer Science 2025-10-06 Guanghao Li , Zhihui Fu , Min Fang , Qibin Zhao , Ming Tang , Chun Yuan , Jun Wang

Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding

LLMs have low GPU efficiency and high latency due to autoregressive decoding. Speculative decoding (SD) mitigates this using a small draft model to speculatively generate multiple tokens, which are then verified in parallel by a target…

Computation and Language · Computer Science 2026-04-21 Sungkyun Kim , Jaemin Kim , Dogyung Yoon , Jiho Shin , Junyeol Lee , Jiwon Seo

LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding

Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Doohyuk Jang , Sihwan Park , June Yong Yang , Yeonsung Jung , Jihun Yun , Souvik Kundu , Sung-Yub Kim , Eunho Yang

SAM Decoding: Speculative Decoding via Suffix Automaton

Speculative decoding (SD) has been demonstrated as an effective technique for lossless LLM inference acceleration. Retrieval-based SD methods, one kind of model-free method, have yielded promising speedup, but they often rely on incomplete…

Computation and Language · Computer Science 2024-12-17 Yuxuan Hu , Ke Wang , Xiaokang Zhang , Fanjin Zhang , Cuiping Li , Hong Chen , Jing Zhang

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

Computation and Language · Computer Science 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation

Autoregressive (AR) image generation models are capable of producing high-fidelity images but often suffer from slow inference due to their inherently sequential, token-by-token decoding process. Speculative decoding, which employs a…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Zhi-Kai Chen , Jun-Peng Jiang , Han-Jia Ye , De-Chuan Zhan