We propose Speculative Decoding (SpecDec), for the first time ever, to formally study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. Speculative Decoding has two innovations: Spec-Drafter -- an independent model specially optimized for efficient and accurate drafting -- and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently in the decoding paradigm. Experimental results on various seq2seq tasks including machine translation and abstractive summarization show our approach can achieve around 5× speedup for the popular Transformer architectures with comparable generation quality to beam search decoding, refreshing the impression that the draft-then-verify paradigm introduces only 1.4×∼2× speedup. In addition to the remarkable speedup, we also demonstrate 3 additional advantages of SpecDec, revealing its practical value for accelerating generative models in real-world applications. Our models and codes are available at https://github.com/hemingkx/SpecDec.
@article{arxiv.2203.16487,
title = {Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation},
author = {Heming Xia and Tao Ge and Peiyi Wang and Si-Qing Chen and Furu Wei and Zhifang Sui},
journal= {arXiv preprint arXiv:2203.16487},
year = {2023}
}
Comments
$\textbf{v1-v4}$ (Early 2022): Initially announced with the name "Generalized Aggressive Decoding"; $\textbf{v5}$ (September 2022): Renamed to "Speculative Decoding" as the ICLR'23 submission (https://openreview.net/pdf?id=H-VlwsYvVi), marking $\textbf{the first time}$ "Speculative Decoding" has been publicly proposed. $\textbf{v6}$: EMNLP'23 Findings camera ready