English
Related papers

Related papers: EasySpec: Layer-Parallel Speculative Decoding for …

200 papers

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-13 Xing Liu , Lizhuo Luo , Ming Tang , Chao Huang , Xu Chen

The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade…

Computation and Language · Computer Science 2025-10-10 Pei-Shuo Wang , Jian-Jia Chen , Chun-Che Yang , Chi-Chih Chang , Ning-Chi Huang , Mohamed S. Abdelfattah , Kai-Chiang Wu

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although…

Machine Learning · Computer Science 2026-05-12 Anton Plaksin , Sergei Krutikov , Sergei Skvortsov , Alexander Samarin

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a…

Computation and Language · Computer Science 2026-05-27 Avinash Kumar , Sujay Sanghavi , Poulami Das

Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-16 Ziyi Zhang , Ziheng Jiang , Chengquan Jiang , Menghan Yu , Size Zheng , Haibin Lin , Henry Hoffmann , Xin Liu

Speculative decoding (SD), where a small draft model is employed to propose draft tokens in advance and then the target model validates them in parallel, has emerged as a promising technique for LLM inference acceleration. Many endeavors to…

Computation and Language · Computer Science 2026-04-30 Tianyu Liu , Qitan Lv , Hao Li , Xing Gao , Xiao Sun , Xiaoyan Sun

Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens…

Machine Learning · Computer Science 2025-02-06 Minghao Yan , Saurabh Agarwal , Shivaram Venkataraman

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter…

Computation and Language · Computer Science 2025-10-06 Guanghao Li , Zhihui Fu , Min Fang , Qibin Zhao , Ming Tang , Chun Yuan , Jun Wang

Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising…

Computation and Language · Computer Science 2025-03-04 Kai Lv , Honglin Guo , Qipeng Guo , Xipeng Qiu

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast…

Computation and Language · Computer Science 2026-05-29 Jian Chen , Yesheng Liang , Zhijian Liu

Speculative decoding (SD) is a promising method for accelerating the decoding process of Large Language Models (LLMs). The efficiency of SD primarily hinges on the consistency between the draft model and the verify model. However, existing…

Computation and Language · Computer Science 2025-06-02 Longze Chen , Renke Shan , Huiming Wang , Lu Wang , Ziqiang Liu , Run Luo , Jiawei Wang , Hamid Alinejad-Rokny , Min Yang

Speculative decoding is a prominent technique to speed up the inference of a large target language model based on predictions of an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both…

Computation and Language · Computer Science 2024-02-20 Nikhil Bhendawade , Irina Belousova , Qichen Fu , Henry Mason , Mohammad Rastegari , Mahyar Najibi

Speculative decoding has emerged as a widely adopted method to accelerate large language model inference without sacrificing the quality of the model outputs. While this technique has facilitated notable speed improvements by enabling…

Computation and Language · Computer Science 2025-02-12 Jacob K Christopher , Brian R Bartoldson , Tal Ben-Nun , Michael Cardei , Bhavya Kailkhura , Ferdinando Fioretto

Speculative decoding (SD) has emerged as an effective technique to accelerate large language model (LLM) inference without compromising output quality. However, the achievable speedup largely depends on the effectiveness of the drafting…

Computation and Language · Computer Science 2025-11-04 Min Fang , Zhihui Fu , Qibin Zhao , Jun Wang

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Phuong Tran , Tzu-Hao Liu , Long Tan Le , Tung-Anh Nguyen , Van Quan La , Eason Yu , Han Shu , Choong Seon Hong , Nguyen H. Tran

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

Machine Learning · Computer Science 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou

Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-05 Yuchen Li , Rui Kong , Zhonghao Lyu , Qiyang Li , Xinran Chen , Hengyi Cai , Lingyong Yan , Shuaiqiang Wang , Jiashu Zhao , Guangxu Zhu , Linghe Kong , Guihai Chen , Haoyi Xiong , Dawei Yin

Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in…

Machine Learning · Computer Science 2026-02-25 Seongjin Cha , Gyuwan Kim , Dongsu Han , Tao Yang , Insu Han
‹ Prev 1 2 3 10 Next ›