English
Related papers

Related papers: FlowSpec: Continuous Pipelined Speculative Decodin…

200 papers

Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full…

Artificial Intelligence · Computer Science 2025-05-06 Bradley McDanel , Sai Qian Zhang , Yunhai Hu , Zining Liu

Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU…

Machine Learning · Computer Science 2025-12-09 Yize Wu , Ke Gao , Ling Li , Yanjun Wu

Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-05 Yuchen Li , Rui Kong , Zhonghao Lyu , Qiyang Li , Xinran Chen , Hengyi Cai , Lingyong Yan , Shuaiqiang Wang , Jiashu Zhao , Guangxu Zhu , Linghe Kong , Guihai Chen , Haoyi Xiong , Dawei Yin

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Recent advancements and widespread adoption of Large Language Models (LLMs) in both industry and academia have catalyzed significant demand for LLM serving. However, traditional cloud services incur high costs, while on-device inference…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-30 Yida Zhang , Zhiyong Gao , Shuaibing Yue , Jie Li , Rui Wang

Large language models (LLMs) have revolutionized natural language processing, yet their high computational demands pose significant challenges for real-time inference, especially in multi-user server speculative decoding and…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-16 Phuong Tran , Tzu-Hao Liu , Long Tan Le , Tung-Anh Nguyen , Van Quan La , Eason Yu , Han Shu , Choong Seon Hong , Nguyen H. Tran

Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce…

Computation and Language · Computer Science 2024-11-19 Branden Butler , Sixing Yu , Arya Mazaheri , Ali Jannesari

The demand for large language model inference is rapidly increasing. Pipeline parallelism offers a cost-effective deployment strategy for distributed inference but suffers from high service latency. While incorporating speculative decoding…

Machine Learning · Computer Science 2025-09-01 Haofei Yin , Mengbai Xiao , Tinghong Li , Xiao Zhang , Dongxiao Yu , Guanghui Zhang

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Speculative decoding speeds up autoregressive generation in Large Language Models (LLMs) through a two-step procedure, where a lightweight draft model proposes tokens which the target model then verifies in a single forward pass. Although…

Machine Learning · Computer Science 2026-05-12 Anton Plaksin , Sergei Krutikov , Sergei Skvortsov , Alexander Samarin

Large language model (LLM) inference at the network edge is a promising serving paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based LLM inference systems typically adopt…

Systems and Control · Electrical Eng. & Systems 2025-10-14 Bingjie Zhu , Zhixiong Chen , Liqiang Zhao , Hyundong Shin , Arumugam Nallanathan

While speculative decoding has recently appeared as a promising direction for accelerating the inference of large language models (LLMs), the speedup and scalability are strongly bounded by the token acceptance rate. Prevalent methods…

Machine Learning · Computer Science 2024-10-16 Yunfan Xiong , Ruoyu Zhang , Yanzeng Li , Tianhao Wu , Lei Zou

Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-16 Ziyi Zhang , Ziheng Jiang , Chengquan Jiang , Menghan Yu , Size Zheng , Haibin Lin , Henry Hoffmann , Xin Liu

As large language models (LLMs) scale up, accuracy improves, but the autoregressive (AR) nature of decoding increases latency since each token requires a serial forward pass. Speculative decoding addresses this by employing a fast drafter…

Computation and Language · Computer Science 2025-10-06 Guanghao Li , Zhihui Fu , Min Fang , Qibin Zhao , Ming Tang , Chun Yuan , Jun Wang

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

Information Theory · Computer Science 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

Efficient inference in large language models (LLMs) has become a critical focus as their scale and complexity grow. Traditional autoregressive decoding, while effective, suffers from computational inefficiencies due to its sequential token…

Computation and Language · Computer Science 2024-11-28 Hyun Ryu , Eric Kim

Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A…

Artificial Intelligence · Computer Science 2026-05-05 Yuanyuan Jia , Shunpu Tang , Qianqian Yang

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-26 Yunhe Han , Yunqi Gao , Bing Hu , Mahdi Boloursaz Mashhadi , Yitong Duan , Pei Xiao , Yanfeng Zhang

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this…

Computation and Language · Computer Science 2026-04-09 Penghui Yang , Cunxiao Du , Fengzhuo Zhang , Haonan Wang , Tianyu Pang , Chao Du , Bo An

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-06 Xiangchen Li , Dimitrios Spatharakis , Saeid Ghafouri , Jiakun Fan , Hans Vandierendonck , Deepu John , Bo Ji , Dimitrios Nikolopoulos
‹ Prev 1 2 3 10 Next ›