Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Lin Zheng; Vasilisa Bashlovkina; Timothy Dozat; Dan Garrette; Laura Rimell; Joshua Maynez

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Computation and Language 2026-05-12 v1 Machine Learning

Authors: Lin Zheng , Vasilisa Bashlovkina , Timothy Dozat , Dan Garrette , Laura Rimell , Joshua Maynez

Abstract

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$ - $4\times$ less inference compute.

Cite

@article{arxiv.2605.09630,
  title  = {Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models},
  author = {Lin Zheng and Vasilisa Bashlovkina and Timothy Dozat and Dan Garrette and Laura Rimell and Joshua Maynez},
  journal= {arXiv preprint arXiv:2605.09630},
  year   = {2026}
}

Comments

23 pages, 15 figures

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Abstract

Cite

Comments

Related papers