Related papers: Speculative Decoding with Big Little Decoder

Fast Inference via Hierarchical Speculative Decoding

Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft…

Machine Learning · Computer Science 2025-10-24 Clara Mohri , Haim Kaplan , Tal Schuster , Yishay Mansour , Amir Globerson

Efficient LLM Inference over Heterogeneous Edge Networks with Speculative Decoding

Large language model (LLM) inference at the network edge is a promising serving paradigm that leverages distributed edge resources to run inference near users and enhance privacy. Existing edge-based LLM inference systems typically adopt…

Systems and Control · Electrical Eng. & Systems 2025-10-14 Bingjie Zhu , Zhixiong Chen , Liqiang Zhao , Hyundong Shin , Arumugam Nallanathan

Tandem Transformers for Inference Efficient LLMs

The autoregressive nature of conventional large language models (LLMs) inherently limits inference speed, as tokens are generated sequentially. While speculative and parallel decoding techniques attempt to mitigate this, they face…

Artificial Intelligence · Computer Science 2024-10-22 Aishwarya P S , Pranav Ajit Nair , Yashas Samaga , Toby Boyd , Sanjiv Kumar , Prateek Jain , Praneeth Netrapalli

3-Model Speculative Decoding

Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a…

Computation and Language · Computer Science 2025-10-16 Sanghyun Byun , Mohanad Odema , Jung Ick Guack , Baisub Lee , Jacob Song , Woo Seong Chung

Self-Speculative Biased Decoding for Faster Re-Translation

Large language models achieve strong machine translation quality but incur high inference cost and latency, posing challenges for simultaneous translation. Re-translation provides a practical solution for off-the-shelf LLMs by repeatedly…

Computation and Language · Computer Science 2026-01-06 Linxiao Zeng , Haoyun Deng , Kangyuan Shu , Shizhen Wang

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft…

Computation and Language · Computer Science 2026-05-27 Kuan-Wei Lu , Ding-Yong Hong , Pangfeng Liu , Jan-Jan Wu

DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain…

Machine Learning · Computer Science 2025-12-02 Fengze Yu , Leshu Li , Brad McDanel , Sai Qian Zhang

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-06 Xiangchen Li , Dimitrios Spatharakis , Saeid Ghafouri , Jiakun Fan , Hans Vandierendonck , Deepu John , Bo Ji , Dimitrios Nikolopoulos

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Large language models (LLMs) have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a…

Artificial Intelligence · Computer Science 2025-03-17 Zongyue Qin , Zifan He , Neha Prakriya , Jason Cong , Yizhou Sun

Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the…

Computation and Language · Computer Science 2026-05-11 Julie Kallini , Artidoro Pagnoni , Tomasz Limisiewicz , Gargi Ghosh , Luke Zettlemoyer , Christopher Potts , Xiaochuang Han , Srinivasan Iyer

An Interpretable Latency Model for Speculative Decoding in LLM Serving

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups…

Machine Learning · Computer Science 2026-05-15 Linghao Kong , Megan Flynn , Michael Peng , Nir Shavit , Mark Kurtz , Alexandre Marques

Towards Resource Efficient and Interpretable Bias Mitigation in Large Language Models

Although large language models (LLMs) have demonstrated their effectiveness in a wide range of applications, they have also been observed to perpetuate unwanted biases present in the training data, potentially leading to harm for…

Computation and Language · Computer Science 2026-03-09 Schrasing Tong , Eliott Zemour , Jessica Lu , Rawisara Lohanimit , Lalana Kagal

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios

Speculative decoding accelerates LLM inference by utilizing otherwise idle computational resources during memory-to-chip data transfer. Current speculative decoding methods typically assume a considerable amount of available computing…

Computation and Language · Computer Science 2025-11-26 Luohe Shi , Zuchao Li , Lefei Zhang , Baoyuan Qi , Guoming Liu , Hai Zhao

Accelerating LLM Inference with Staged Speculative Decoding

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low…

Artificial Intelligence · Computer Science 2023-08-10 Benjamin Spector , Chris Re

SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding

Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of…

Computation and Language · Computer Science 2024-12-18 Zhenglin Wang , Jialong Wu , Yilong Lai , Congzhi Zhang , Deyu Zhou

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target…

Computation and Language · Computer Science 2025-06-27 Baohao Liao , Yuhui Xu , Hanze Dong , Junnan Li , Christof Monz , Silvio Savarese , Doyen Sahoo , Caiming Xiong

Multi-Drafter Speculative Decoding with Alignment Feedback

Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller model to draft future tokens, which are then verified by the target LLM. This preserves generation quality by accepting only aligned tokens.…

Computation and Language · Computer Science 2026-04-08 Taehyeon Kim , Hojung Jung , Se-Young Yun

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui