Fast Collaborative Inference via Distributed Speculative Decoding

Ce Zheng; Ke Zhang; Chen Sun; Wenqi Zhang; Qiong Liu; Angesom Ataklity Tesfay

Fast Collaborative Inference via Distributed Speculative Decoding

Signal Processing 2026-01-13 v2

Authors: Ce Zheng , Ke Zhang , Chen Sun , Wenqi Zhang , Qiong Liu , Angesom Ataklity Tesfay

Abstract

Speculative decoding accelerates large language model (LLM) inference by allowing a small draft model to predict multiple future tokens for verification by a larger target model. In AI-native radio access networks (AI-RAN), this enables device-edge collaborative inference but introduces significant uplink overhead, as existing distributed speculative decoding schemes transmit full vocabulary logits at every step. We propose a sparsify-then-sample strategy, Truncated Sparse Logits Transmission (TSLT), which transmits only the logits and indices of a truncated candidate set. We provide theoretical guarantees showing that the acceptance rate is preserved under TSLT. TSLT is further extended to multi-candidate case, where multiple draft candidates per step increase acceptance probability. Experiments show that TSLT significantly reduces uplink communication while maintaining end-to-end inference latency and model quality, demonstrating its effectiveness for scalable, communication-efficient distributed LLM inference in future AI-RAN systems.

Keywords

speech recognition and language modeling deep learning for wireless communications deep learning for detection and classification

Cite

@article{arxiv.2512.16273,
  title  = {Fast Collaborative Inference via Distributed Speculative Decoding},
  author = {Ce Zheng and Ke Zhang and Chen Sun and Wenqi Zhang and Qiong Liu and Angesom Ataklity Tesfay},
  journal= {arXiv preprint arXiv:2512.16273},
  year   = {2026}
}

Fast Collaborative Inference via Distributed Speculative Decoding

Abstract

Keywords

Cite

Related papers