English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In…

Computation and Language · Computer Science 2019-06-26 Shuoyang Ding , Adithya Renduchintala , Kevin Duh

Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on…

Machine Learning · Computer Science 2025-11-25 Michael J. Bommarito

We introduce three simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into…

Computation and Language · Computer Science 2023-05-05 Jonne Sälevä , Constantine Lignos

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can…

Computation and Language · Computer Science 2019-12-09 Changhan Wang , Kyunghyun Cho , Jiatao Gu

Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of…

Computation and Language · Computer Science 2025-07-17 Amos You

In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially…

Computation and Language · Computer Science 2025-04-29 Gunjan Balde , Soumyadeep Roy , Mainack Mondal , Niloy Ganguly

The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair…

Genomics · Quantitative Biology 2025-05-15 Marina Popova , Iaroslav Chelombitko , Aleksey Komissarov

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages…

Computation and Language · Computer Science 2017-10-09 Benjamin Heinzerling , Michael Strube

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not…

Computation and Language · Computer Science 2024-04-02 Marco Cognetta , Tatsuya Hiraoka , Naoaki Okazaki , Rico Sennrich , Yuval Pinter

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers…

Computation and Language · Computer Science 2026-02-16 Saumitra Yadav , Manish Shrivastava

Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even…

Computation and Language · Computer Science 2020-10-07 Kyubyong Park , Joohong Lee , Seongbo Jang , Dawoon Jung

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal…

Computation and Language · Computer Science 2024-09-10 Pavel Chizhov , Catherine Arnett , Elizaveta Korotkova , Ivan P. Yamshchikov

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are…

Computation and Language · Computer Science 2025-11-10 Firoj Ahmmed Patwary , Abdullah Al Noman

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace…

Formal Languages and Automata Theory · Computer Science 2023-09-19 Martin Berglund , Brink van der Merwe

Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which…

Computation and Language · Computer Science 2026-03-23 Azam Nouri

Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages.…

Computation and Language · Computer Science 2025-02-10 Langlin Huang , Mengyu Bu , Yang Feng

Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique…

Computation and Language · Computer Science 2022-10-12 Odunayo Ogundepo , Xinyu Zhang , Jimmy Lin

In this paper, we aim to do code completion based on implementing a Neural Network from Li et. al.. Our contribution is that we use an encoding that is in-between character and word encoding called Byte Pair Encoding (BPE). We use this on…

Computation and Language · Computer Science 2020-04-15 Youri Arkesteijn , Nikhil Saldanha , Bastijn Kostense