English

Batching BPE Tokenization Merges

Computation and Language 2024-08-12 v1 Artificial Intelligence

Abstract

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training make it feasible to train a high quality tokenizer on a basic laptop. This paper presents BatchBPE, an open-source pure Python implementation of these concepts, with the goal of making experimenting with new tokenization strategies more accessible especially in compute- and memory-constrained contexts. BatchBPE's usefulness and malleability are demonstrated through the training of several token vocabularies to explore the batch merging process and experiment with preprocessing a stop word list and ignoring the least common text chunks in a dataset. Resultant encoded lengths of texts are used as a basic evaluation metric.

Keywords

Cite

@article{arxiv.2408.04653,
  title  = {Batching BPE Tokenization Merges},
  author = {Alexander P. Morgan},
  journal= {arXiv preprint arXiv:2408.04653},
  year   = {2024}
}

Comments

8 pages, 5 figures, 1 code block