中文
相关论文

相关论文: DiffuSpec: Unlocking Diffusion Language Models for…

200 篇论文

Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify…

机器学习 · 计算机科学 2025-12-18 Zicong Cheng , Guo-Wei Yang , Jia Li , Zhijie Deng , Meng-Hao Guo , Shi-Min Hu

Diffusion Large Language Models (dLLMs) offer fast, parallel token generation, but their standalone use is plagued by an inherent efficiency-quality tradeoff. We show that, if carefully applied, the attributes of dLLMs can actually be a…

机器学习 · 计算机科学 2026-01-29 Rui Pan , Zhuofu Chen , Hongyi Liu , Arvind Krishnamurthy , Ravi Netravali

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast…

计算与语言 · 计算机科学 2026-05-29 Jian Chen , Yesheng Liang , Zhijian Liu

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

计算与语言 · 计算机科学 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for…

计算与语言 · 计算机科学 2026-01-13 Kaiyu Huang , Hao Wu , Zhubo Shi , Han Zou , Minchen Yu , Qingjiang Shi

Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive…

计算与语言 · 计算机科学 2025-11-05 Jameson Sandler , Jacob K. Christopher , Thomas Hartvigsen , Ferdinando Fioretto

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

信息论 · 计算机科学 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

计算与语言 · 计算机科学 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

Speculative decoding accelerates LLM inference by letting a small drafter propose multiple tokens which a large target model verifies once per speculation step. As vocabularies scale past 10e5 tokens,verification cost in the target model is…

计算与语言 · 计算机科学 2026-02-04 Jinbin Zhang , Nasib Ullah , Erik Schultheis , Rohit Babbar

Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU…

机器学习 · 计算机科学 2025-12-09 Yize Wu , Ke Gao , Ling Li , Yanjun Wu

Speculative decoding is an effective and lossless approach for accelerating LLM inference. However, existing widely adopted model-based draft designs, such as EAGLE3, improve accuracy at the cost of multi-step autoregressive inference,…

计算与语言 · 计算机科学 2026-01-28 Fuliang Liu , Xue Li , Ketai Zhao , Yinxi Gao , Ziyan Zhou , Zhonghui Zhang , Zhibin Wang , Wanchun Dou , Sheng Zhong , Chen Tian

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

机器学习 · 计算机科学 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou

Diffusion-based Large Language Models (dLLMs) have emerged as a competitive alternative to autoregressive models, offering unique advantages through bidirectional attention and parallel generation paradigms. However, the generation results…

计算与语言 · 计算机科学 2025-10-07 Yifeng Gao , Ziang Ji , Yuxuan Wang , Biqing Qi , Hanlin Xu , Linfeng Zhang

The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade…

Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device…

分布式、并行与集群计算 · 计算机科学 2026-01-13 Xing Liu , Lizhuo Luo , Ming Tang , Chao Huang , Xu Chen

Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks, including document processing and code generation. Autoregressive Language Models (ARMs), which generate…

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to…

计算与语言 · 计算机科学 2026-04-30 Tianyu Liu , Qitan Lv , Hao Li , Xing Gao , Xiao Sun , Xiaoyan Sun

Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft…

计算与语言 · 计算机科学 2026-04-15 Zhuofan Wen , Yang Feng

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain…

机器学习 · 计算机科学 2025-12-02 Fengze Yu , Leshu Li , Brad McDanel , Sai Qian Zhang
‹ 上一页 1 2 3 10 下一页 ›