English
Related papers

Related papers: SPEED: Speculative Pipelined Execution for Efficie…

200 papers

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

With the increasingly giant scales of (causal) large language models (LLMs), the inference efficiency comes as one of the core concerns along the improved performance. In contrast to the memory footprint, the latency bottleneck seems to be…

Computation and Language · Computer Science 2024-04-24 Chen Zhang , Zhuorui Liu , Dawei Song

Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of…

Computation and Language · Computer Science 2024-12-18 Zhenglin Wang , Jialong Wu , Yilong Lai , Congzhi Zhang , Deyu Zhou

This technical report describes the design and training of novel speculative decoding draft models, for accelerating the inference speeds of large language models in a production environment. By conditioning draft predictions on both…

Computation and Language · Computer Science 2024-06-10 Davis Wertheimer , Joshua Rosenkranz , Thomas Parnell , Sahil Suneja , Pavithra Ranganathan , Raghu Ganti , Mudhakar Srivatsa

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face…

Artificial Intelligence · Computer Science 2024-10-22 Aishwarya P S , Pranav Ajit Nair , Yashas Samaga , Toby Boyd , Sanjiv Kumar , Prateek Jain , Praneeth Netrapalli

Large language model (LLM) inference at the network edge is a promising serving paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based LLM inference systems typically adopt…

Systems and Control · Electrical Eng. & Systems 2025-10-14 Bingjie Zhu , Zhixiong Chen , Liqiang Zhao , Hyundong Shin , Arumugam Nallanathan

Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-13 Xing Liu , Lizhuo Luo , Ming Tang , Chao Huang , Xu Chen

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce…

Computation and Language · Computer Science 2024-11-19 Branden Butler , Sixing Yu , Arya Mazaheri , Ali Jannesari

Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Phuong Tran , Tzu-Hao Liu , Long Tan Le , Tung-Anh Nguyen , Van Quan La , Eason Yu , Han Shu , Choong Seon Hong , Nguyen H. Tran

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Federated inference enhances LLM performance in edge computing through weighted averaging of distributed model predictions. However, autoregressive LLM inference requires frequent full-model forward passes across workers, severely limiting…

Signal Processing · Electrical Eng. & Systems 2026-04-29 Ce Zheng , Xinghan Wang , Jiahong Ning , Yuxuan Shi , Ning Huang , Tingting Yang

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Speculative decoding has been shown as an effective way to accelerate Large Language Model (LLM) inference by using a Small Speculative Model (SSM) to generate candidate tokens in a so-called speculation phase, which are subsequently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-21 Fahao Chen , Peng Li , Tom H. Luan , Zhou Su , Jing Deng

This tutorial presents a comprehensive introduction to Speculative Decoding (SD), an advanced technique for LLM inference acceleration that has garnered significant research interest in recent years. SD is introduced as an innovative…

Computation and Language · Computer Science 2025-03-04 Heming Xia , Cunxiao Du , Yongqi Li , Qian Liu , Wenjie Li

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any…

Machine Learning · Computer Science 2023-05-22 Yaniv Leviathan , Matan Kalman , Yossi Matias

Accelerating the inference of large language models (LLMs) is a critical challenge in generative AI. Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.…

Computation and Language · Computer Science 2025-06-12 Nadav Timor , Jonathan Mamou , Daniel Korat , Moshe Berchansky , Gaurav Jain , Oren Pereg , Moshe Wasserblat , David Harel

Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in…

Computation and Language · Computer Science 2024-11-12 Euiin Yi , Taehyeon Kim , Hongseok Jeung , Du-Seong Chang , Se-Young Yun

Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of…

Computation and Language · Computer Science 2026-02-05 Ximing Dong , Shaowei Wang , Dayi Lin , Boyuan Chen , Ahmed E. Hassan
‹ Prev 1 2 3 10 Next ›