中文
相关论文

相关论文: DFlash: Block Diffusion for Flash Speculative Deco…

200 篇论文

Speculative decoding accelerates autoregressive language models by using a lightweight drafter to propose multiple future tokens, which the target model then verifies in parallel. DFlash shows that a block diffusion drafter can generate an…

计算与语言 · 计算机科学 2026-04-15 Liran Ringel , Yaniv Romano

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a…

机器学习 · 计算机科学 2026-01-29 Rui Pan , Zhuofu Chen , Hongyi Liu , Arvind Krishnamurthy , Ravi Netravali

Large language models and large multimodal models (LLMs and LMMs) deliver strong generative performance but suffer from slow decoding, a problem that becomes more severe when handling visual inputs, whose sequences typically contain many…

计算机视觉与模式识别 · 计算机科学 2026-02-04 Zihua Wang , Ruibo Li , Haozhe Du , Joey Tianyi Zhou , Yu Zhang , Xu Yang

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results…

计算与语言 · 计算机科学 2025-10-07 Yifeng Gao , Ziang Ji , Yuxuan Wang , Biqing Qi , Hanlin Xu , Linfeng Zhang

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive…

计算与语言 · 计算机科学 2025-11-05 Jameson Sandler , Jacob K. Christopher , Thomas Hartvigsen , Ferdinando Fioretto

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

计算与语言 · 计算机科学 2024-11-28 Hyun Ryu , Eric Kim

Masked Diffusion Models (MDMs) offer a promising alternative to autoregressive language models by enabling parallel token generation and bidirectional context modeling. However, their inference speed is significantly limited by the…

机器学习 · 计算机科学 2026-04-08 Satyam Goyal , Kushal Patel , Tanush Mittal , Arjun Laxman

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter…

计算与语言 · 计算机科学 2025-10-06 Guanghao Li , Zhihui Fu , Min Fang , Qibin Zhao , Ming Tang , Chun Yuan , Jun Wang

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

计算与语言 · 计算机科学 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often…

机器学习 · 计算机科学 2026-01-15 Sudhanshu Agrawal , Risheek Garrepalli , Raghavv Goel , Mingu Lee , Christopher Lott , Fatih Porikli

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

计算与语言 · 计算机科学 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs for text generation, with the potential to decode multiple tokens in a single iteration. However, none of the existing open-source…

机器学习 · 计算机科学 2025-08-14 Xu Wang , Chenkai Xu , Yijie Jin , Jiachun Jin , Hao Zhang , Zhijie Deng

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

计算与语言 · 计算机科学 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

计算与语言 · 计算机科学 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

机器学习 · 计算机科学 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate…

计算与语言 · 计算机科学 2025-05-14 Danying Ge , Jianhua Gao , Qizhi Jiang , Yifei Feng , Weixing Ji

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM…

机器学习 · 计算机科学 2026-04-20 Xiang Xia , Wuyang Zhang , Jiazheng Liu , Cheng Yan , Yanyong Zhang

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both…

计算与语言 · 计算机科学 2024-02-20 Nikhil Bhendawade , Irina Belousova , Qichen Fu , Henry Mason , Mohammad Rastegari , Mahyar Najibi

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

计算与语言 · 计算机科学 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel
‹ 上一页 1 2 3 10 下一页 ›