Related papers: A Pipelined Collaborative Speculative Decoding Fra…

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-26 Yunhe Han , Yunqi Gao , Bing Hu , Mahdi Boloursaz Mashhadi , Yitong Duan , Pei Xiao , Yanfeng Zhang

FlexSpec: Frozen Drafts Meet Evolving Targets in Edge-Cloud Collaborative LLM Speculative Decoding

Deploying large language models (LLMs) in mobile and edge computing environments is constrained by limited on-device resources, scarce wireless bandwidth, and frequent model evolution. Although edge-cloud collaborative inference with…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-05 Yuchen Li , Rui Kong , Zhonghao Lyu , Qiyang Li , Xinran Chen , Hengyi Cai , Lingyong Yan , Shuaiqiang Wang , Jiashu Zhao , Guangxu Zhu , Linghe Kong , Guihai Chen , Haoyi Xiong , Dawei Yin

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference

Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-13 Xing Liu , Lizhuo Luo , Ming Tang , Chao Huang , Xu Chen

PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding

Speculative decoding accelerates large language model inference by using smaller draft models to generate candidate tokens for parallel verification. However, current approaches are limited by sequential stage dependencies that prevent full…

Artificial Intelligence · Computer Science 2025-05-06 Bradley McDanel , Sai Qian Zhang , Yunhai Hu , Zining Liu

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Large language model (LLM) inference at the network edge is a promising serving paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based LLM inference systems typically adopt…

Systems and Control · Electrical Eng. & Systems 2025-10-14 Bingjie Zhu , Zhixiong Chen , Liqiang Zhao , Hyundong Shin , Arumugam Nallanathan

Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which…

Robotics · Computer Science 2025-05-29 Yeshwanth Venkatesha , Souvik Kundu , Priyadarshini Panda

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-06 Xiangchen Li , Dimitrios Spatharakis , Saeid Ghafouri , Jiakun Fan , Hans Vandierendonck , Deepu John , Bo Ji , Dimitrios Nikolopoulos

Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding

Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost because each output token is generated auto-regressively through all model layers. Early-exit based self-speculative decoding (EESD) has…

Computation and Language · Computer Science 2025-09-25 Ruanjun Li , Ziheng Liu , Yuanming Shi , Jiawei Shao , Chi Zhang , Xuelong Li

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain…

Machine Learning · Computer Science 2025-12-02 Fengze Yu , Leshu Li , Brad McDanel , Sai Qian Zhang

DSSD: Efficient Edge-Device LLM Deployment and Collaborative Inference via Distributed Split Speculative Decoding

Large language models (LLMs) have transformed natural language processing but face critical deployment challenges in device-edge systems due to resource limitations and communication overhead. To address these issues, collaborative…

Signal Processing · Electrical Eng. & Systems 2025-07-18 Jiahong Ning , Ce Zheng , Tingting Yang

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario with multi-user inference, where draft…

Information Theory · Computer Science 2026-04-24 Yaodan Xu , Sheng Zhou , Zhisheng Niu

ConfigSpec: Profiling-Based Configuration Selection for Distributed Edge--Cloud Speculative LLM Serving

Speculative decoding enables collaborative Large Language Model (LLM) inference across cloud and edge by separating lightweight token drafting from heavyweight verification. While prior systems show performance and cost benefits, practical…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Xiangchen Li , Saeid Ghafouri , Jiakun Fan , Babar Ali , Hans Vandierendonck , Dimitrios S. Nikolopoulos

CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

Vision-language models (VLMs) have demonstrated strong capabilities in multimodal perception and reasoning. However, deploying large VLMs on mobile devices remains challenging due to their substantial computational and memory demands. A…

Artificial Intelligence · Computer Science 2026-05-05 Yuanyuan Jia , Shunpu Tang , Qianqian Yang

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce…

Computation and Language · Computer Science 2024-11-19 Branden Butler , Sixing Yu , Arya Mazaheri , Ali Jannesari

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive LLMs

Large language models (LLMs) power many modern applications, but serving them at scale remains costly and resource-intensive. Current server-centric systems overlook consumer-grade GPUs at the edge. We introduce SpecEdge, an edge-assisted…

Computation and Language · Computer Science 2025-11-19 Jinwoo Park , Seunggeun Cho , Dongsu Han

SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding

Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-16 Ziyi Zhang , Ziheng Jiang , Chengquan Jiang , Menghan Yu , Size Zheng , Haibin Lin , Henry Hoffmann , Xin Liu

Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding

In edge-cloud speculative decoding (SD), edge devices equipped with small language models (SLMs) generate draft tokens that are verified by large language models (LLMs) in the cloud. A key bottleneck in such systems is the limited…

Signal Processing · Electrical Eng. & Systems 2026-01-13 Guangyi Zhang , Yunlong Cai , Guanding Yu , Petar Popovski , Osvaldo Simeone

AdaSpec: Adaptive Speculative Decoding for Fast, SLO-Aware Large Language Model Serving

Cloud-based Large Language Model (LLM) services often face challenges in achieving low inference latency and meeting Service Level Objectives (SLOs) under dynamic request patterns. Speculative decoding, which exploits lightweight models for…

Computation and Language · Computer Science 2026-01-13 Kaiyu Huang , Hao Wu , Zhubo Shi , Han Zou , Minchen Yu , Qingjiang Shi

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges

As large language models (LLMs) evolve, deploying them solely in the cloud or compressing them for edge devices has become inadequate due to concerns about latency, privacy, cost, and personalization. This survey explores a collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-23 Senyao Li , Haozhao Wang , Wenchao Xu , Rui Zhang , Song Guo , Jingling Yuan , Xian Zhong , Tianwei Zhang , Ruixuan Li

Progressive Semantic Communication for Efficient Edge-Cloud Vision-Language Models

Deploying Vision-Language Models (VLMs) on edge devices remains challenging due to their substantial computational and memory demands, which exceed the capabilities of resource-constrained embedded platforms. Conversely, fully offloading…

Machine Learning · Computer Science 2026-04-30 Cyril Shih-Huan Hsu , Wig Yuan-Cheng Cheng , Chrysa Papagianni