English

Draft-based Approximate Inference for LLMs

Computation and Language 2026-02-03 v3 Artificial Intelligence

Abstract

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt tokens; and (iii) SpecKV-PC, a cascaded compression strategy combining both techniques. Extensive experiments on long-context benchmarks demonstrate that our methods consistently achieve higher accuracy than existing baselines while retaining the same efficiency gains in memory usage, latency, and throughput.

Keywords

Cite

@article{arxiv.2506.08373,
  title  = {Draft-based Approximate Inference for LLMs},
  author = {Kevin Galim and Ethan Ewer and Wonjun Kang and Minjae Lee and Hyung Il Koo and Kangwook Lee},
  journal= {arXiv preprint arXiv:2506.08373},
  year   = {2026}
}

Comments

Accepted to ICLR 2026

R2 v1 2026-07-01T03:08:13.415Z