English

Accelerating LLM Inference with Staged Speculative Decoding

Artificial Intelligence 2023-08-10 v1 Computation and Language

Abstract

Recent advances with large language models (LLM) illustrate their diverse capabilities. We propose a novel algorithm, staged speculative decoding, to accelerate LLM inference in small-batch, on-device scenarios. We address the low arithmetic intensity of small-batch inference by improving upon previous work in speculative decoding. First, we restructure the speculative batch as a tree, which reduces generation costs and increases the expected tokens per batch. Second, we add a second stage of speculative decoding. Taken together, we reduce single-batch decoding latency by 3.16x with a 762M parameter GPT-2-L model while perfectly preserving output quality.

Keywords

Cite

@article{arxiv.2308.04623,
  title  = {Accelerating LLM Inference with Staged Speculative Decoding},
  author = {Benjamin Spector and Chris Re},
  journal= {arXiv preprint arXiv:2308.04623},
  year   = {2023}
}

Comments

Published at ES-FOMO at ICML 2023