Related papers: BlockBPE: Parallel BPE Tokenization

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that…

Computation and Language · Computer Science 2026-03-04 Venu Gopal Kadamba , Kanishkha Jaisankar

Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis

Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on…

Machine Learning · Computer Science 2025-11-25 Michael J. Bommarito

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as…

Computation and Language · Computer Science 2025-10-03 Craig W. Schmidt , Varshini Reddy , Chris Tanner , Yuval Pinter

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization,…

Computation and Language · Computer Science 2025-06-02 Sander Land , Catherine Arnett

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

LBPE: Long-token-first Tokenization to Improve Large Language Models

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich…

Computation and Language · Computer Science 2024-11-11 Haoran Lian , Yizhe Xiong , Zijia Lin , Jianwei Niu , Shasha Mo , Hui Chen , Peng Liu , Guiguang Ding

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare…

Computation and Language · Computer Science 2025-10-20 Rares Dolga , Lucas Maystre , Tudor Berariu , David Barber

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by…

Artificial Intelligence · Computer Science 2025-03-11 Wanpeng Zhang , Zilong Xie , Yicheng Feng , Yijiang Li , Xingrun Xing , Sipeng Zheng , Zongqing Lu

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a…

Computation and Language · Computer Science 2024-11-14 Haoran Lian , Yizhe Xiong , Jianwei Niu , Shasha Mo , Zhenpeng Su , Zijia Lin , Hui Chen , Peng Liu , Jungong Han , Guiguang Ding

A Partition Cover Approach to Tokenization

Tokenization is the process of encoding strings into tokens of a fixed vocabulary size, and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte-Pair Encoding (BPE), which…

Computation and Language · Computer Science 2025-09-30 Jia Peng Lim , Shawn Tan , Davin Choo , Hady W. Lauw

Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices

Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like…

Computation and Language · Computer Science 2026-05-04 Liu Zai , Iraklis Klampanos

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic…

Computation and Language · Computer Science 2026-01-27 Sawsan Alqahtani , Mir Tafseer Nayeem , Md Tahmid Rahman Laskar , Tasnim Mohiuddin , M Saiful Bari

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm…

Computation and Language · Computer Science 2025-11-10 Maharaj Brahma , N J Karthika , Atul Singh , Devaraj Adiga , Smruti Bhate , Ganesh Ramakrishnan , Rohit Saluja , Maunendra Sankar Desarkar

Theoretical Analysis of Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM)…

Data Structures and Algorithms · Computer Science 2024-11-14 László Kozma , Johannes Voderholzer

A Formal Perspective on Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE…

Computation and Language · Computer Science 2024-09-04 Vilém Zouhar , Clara Meister , Juan Luis Gastaldi , Li Du , Tim Vieira , Mrinmaya Sachan , Ryan Cotterell