Related papers: Batching BPE Tokenization Merges

A Call for Prudent Choice of Subword Merge Operations in Neural Machine Translation

Most neural machine translation systems are built upon subword units extracted by methods such as Byte-Pair Encoding (BPE) or wordpiece. However, the choice of number of merge operations is generally made by following existing recipes. In…

Computation and Language · Computer Science 2019-06-26 Shuoyang Ding , Adithya Renduchintala , Kevin Duh

Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis

Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on…

Machine Learning · Computer Science 2025-11-25 Michael J. Bommarito

What changes when you randomly choose BPE merge operations? Not much

We introduce three simple randomized variants of byte pair encoding (BPE) and explore whether randomizing the selection of merge operations substantially affects a downstream machine translation task. We focus on translation into…

Computation and Language · Computer Science 2023-05-05 Jonne Sälevä , Constantine Lignos

Neural Machine Translation with Byte-Level Subwords

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can…

Computation and Language · Computer Science 2019-12-09 Changhan Wang , Kyunghyun Cho , Jiatao Gu

BlockBPE: Parallel BPE Tokenization

Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of…

Computation and Language · Computer Science 2025-07-17 Amos You

Adaptive BPE Tokenization for Enhanced Vocabulary Adaptation in Finetuning Pretrained Language Models

In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially…

Computation and Language · Computer Science 2025-04-29 Gunjan Balde , Soumyadeep Roy , Mainack Mondal , Niloy Ganguly

When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair…

Genomics · Quantitative Biology 2025-05-15 Marina Popova , Iaroslav Chelombitko , Aleksey Komissarov

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

We present BPEmb, a collection of pre-trained subword unit embeddings in 275 languages, based on Byte-Pair Encoding (BPE). In an evaluation using fine-grained entity typing as testbed, BPEmb performs competitively, and for some languages…

Computation and Language · Computer Science 2017-10-09 Benjamin Heinzerling , Michael Strube

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords. The technique is available in popular tokenization libraries but has not…

Computation and Language · Computer Science 2024-04-02 Marco Cognetta , Tatsuya Hiraoka , Naoaki Okazaki , Rico Sennrich , Yuval Pinter

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers…

Computation and Language · Computer Science 2026-02-16 Saumitra Yadav , Manish Shrivastava

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even…

Computation and Language · Computer Science 2020-10-07 Kyubyong Park , Joohong Lee , Seongbo Jang , Dawoon Jung

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal…

Computation and Language · Computer Science 2024-09-10 Pavel Chizhov , Catherine Arnett , Elizaveta Korotkova , Ivan P. Yamshchikov

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are…

Computation and Language · Computer Science 2025-11-10 Firoj Ahmmed Patwary , Abdullah Al Noman

Formalizing BPE Tokenization

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace…

Formal Languages and Automata Theory · Computer Science 2023-09-19 Martin Berglund , Brink van der Merwe

Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which…

Computation and Language · Computer Science 2026-03-23 Azam Nouri

MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation

Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages.…

Computation and Language · Computer Science 2025-02-10 Langlin Huang , Mengyu Bu , Yang Feng

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique…

Computation and Language · Computer Science 2022-10-12 Odunayo Ogundepo , Xinyu Zhang , Jimmy Lin

Code Completion using Neural Attention and Byte Pair Encoding

In this paper, we aim to do code completion based on implementing a Neural Network from Li et. al.. Our contribution is that we use an encoding that is in-between character and word encoding called Byte Pair Encoding (BPE). We use this on…

Computation and Language · Computer Science 2020-04-15 Youri Arkesteijn , Nikhil Saldanha , Bastijn Kostense