Tokenization with Split Trees

Craig W. Schmidt; Michael Krumdick; Adam Wiemerslage; Seth Ebner; Varshini Reddy; Yuval Pinter; Chris Tanner

Tokenization with Split Trees

Computation and Language 2026-05-28 v2

Authors: Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner

View on arXiv ↗ PDF ↗

Abstract

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vocabulary, inference recursively descends each split tree and emits the first in-vocabulary node reached on each path. Vocabulary selection is formulated as an Integer Program (IP) that minimizes the total token count over all split trees under this inference procedure. The Linear Programming (LP) relaxation is near-integral in practice, yielding provably near-optimal vocabularies, with training time empirically scaling quadratically in the number of split trees. On English text, ToaST reduces token counts by more than 11% compared to BPE, WordPiece, and UnigramLM at vocabulary sizes of 40,960 and above, reducing the number of inference tokens for models using this tokenizer, thus extending the effective context length. ToaST also uses common single-byte tokens less frequently than these baselines, leading to a substantial improvement in Renyi efficiency. In experiments training 1.5B parameter language models, ToaST achieves the highest CORE score, outperforming baselines by 2.6%--7.6%, with significance for two of three, and scoring best on 13 of 22 individual tasks.

Keywords

tokenization decision tree parsing

Cite

@article{arxiv.2605.22705,
  title  = {Tokenization with Split Trees},
  author = {Craig W. Schmidt and Michael Krumdick and Adam Wiemerslage and Seth Ebner and Varshini Reddy and Yuval Pinter and Chris Tanner},
  journal= {arXiv preprint arXiv:2605.22705},
  year   = {2026}
}

Comments

All baseline tokenizers (BPE, WordPiece, Unigram) were trained incorrectly due to a bug in the Hugging Face tokenizers library: pair counts overflow i32 above ~108 GB of training data, dropping the most common merge pairs. All comparisons to ToaST are invalid. Thanks to Sander Land for identifying the missing merge pairs. See https://github.com/huggingface/tokenizers/issues/2058

Tokenization with Split Trees

Abstract

Keywords

Cite

Comments

Related papers