相关论文: DiffuSpec: Unlocking Diffusion Language Models for…

DEER: Draft with Diffusion, Verify with Autoregressive Models

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

机器学习 · 计算机科学 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a…

机器学习 · 计算机科学 2026-01-29 Rui Pan , Zhuofu Chen , Hongyi Liu , Arvind Krishnamurthy , Ravi Netravali

DFlash: Block Diffusion for Flash Speculative Decoding

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast…

计算与语言 · 计算机科学 2026-05-29 Jian Chen , Yesheng Liang , Zhijian Liu

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

计算与语言 · 计算机科学 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for…

计算与语言 · 计算机科学 2026-01-13 Kaiyu Huang , Hao Wu , Zhubo Shi , Han Zou , Minchen Yu , Qingjiang Shi

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive…

计算与语言 · 计算机科学 2025-11-05 Jameson Sandler , Jacob K. Christopher , Thomas Hartvigsen , Ferdinando Fioretto

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

信息论 · 计算机科学 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

计算与语言 · 计算机科学 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is…

计算与语言 · 计算机科学 2026-02-04 Jinbin Zhang , Nasib Ullah , Erik Schultheis , Rohit Babbar

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU…

机器学习 · 计算机科学 2025-12-09 Yize Wu , Ke Gao , Ling Li , Yanjun Wu

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

计算与语言 · 计算机科学 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

计算与语言 · 计算机科学 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

机器学习 · 计算机科学 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou

Self Speculative Decoding for Diffusion Large Language Models

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results…

计算与语言 · 计算机科学 2025-10-07 Yifeng Gao , Ziang Ji , Yuxuan Wang , Biqing Qi , Hanlin Xu , Linfeng Zhang

Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade…

计算与语言 · 计算机科学 2025-10-10 Pei-Shuo Wang , Jian-Jia Chen , Chun-Che Yang , Chi-Chih Chang , Ning-Chi Huang , Mohamed S. Abdelfattah , Kai-Chiang Wu

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device…

分布式、并行与集群计算 · 计算机科学 2026-01-13 Xing Liu , Lizhuo Luo , Ming Tang , Chao Huang , Xu Chen

Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models

Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and code generation. Autoregressive Language Models (ARMs), which generate…

机器学习 · 计算机科学 2025-12-16 Minseo Kim , Coleman Hooper , Aditya Tomar , Chenfeng Xu , Mehrdad Farajtabar , Michael W. Mahoney , Kurt Keutzer , Amir Gholami

LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to…

计算与语言 · 计算机科学 2026-04-30 Tianyu Liu , Qitan Lv , Hao Li , Xing Gao , Xiao Sun , Xiaoyan Sun

SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft…

计算与语言 · 计算机科学 2026-04-15 Zhuofan Wen , Yang Feng

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain…

机器学习 · 计算机科学 2025-12-02 Fengze Yu , Leshu Li , Brad McDanel , Sai Qian Zhang