A Formal Perspective on Byte-Pair Encoding

Vilém Zouhar; Clara Meister; Juan Luis Gastaldi; Li Du; Tim Vieira; Mrinmaya Sachan; Ryan Cotterell

A Formal Perspective on Byte-Pair Encoding

Computation and Language 2024-09-04 v3 Optimization and Control

Authors: Vilém Zouhar , Clara Meister , Juan Luis Gastaldi , Li Du , Tim Vieira , Mrinmaya Sachan , Ryan Cotterell

Abstract

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE seeks to solve has not yet been laid down. We formalize BPE as a combinatorial optimization problem. Via submodular functions, we prove that the iterative greedy version is a $\frac{1}{{\sigma(\boldsymbol{\mu}^\star)}}(1-e^{-{\sigma(\boldsymbol{\mu}^\star)}})$ -approximation of an optimal merge sequence, where ${\sigma(\boldsymbol{\mu}^\star)}$ is the total backward curvature with respect to the optimal merge sequence $\boldsymbol{\mu}^\star$ . Empirically the lower bound of the approximation is $\approx 0.37$ . We provide a faster implementation of BPE which improves the runtime complexity from $\mathcal{O}\left(N M\right)$ to $\mathcal{O}\left(N \log M\right)$ , where $N$ is the sequence length and $M$ is the merge count. Finally, we optimize the brute-force algorithm for optimal BPE using memoization.

Keywords

approximation algorithm source coding mixed precision training

Cite

@article{arxiv.2306.16837,
  title  = {A Formal Perspective on Byte-Pair Encoding},
  author = {Vilém Zouhar and Clara Meister and Juan Luis Gastaldi and Li Du and Tim Vieira and Mrinmaya Sachan and Ryan Cotterell},
  journal= {arXiv preprint arXiv:2306.16837},
  year   = {2024}
}

Comments

ACL 2023

A Formal Perspective on Byte-Pair Encoding

Abstract

Keywords

Cite

Comments

Related papers