Related papers: Frequency-Ordered Tokenization for Better Text Com…

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

Assessing the Importance of Frequency versus Compositionality for Subword-based Tokenization in NMT

Subword tokenization is the de facto standard for tokenization in neural language models and machine translation systems. Three advantages are frequently cited in favor of subwords: shorter encoding of frequent tokens, compositionality of…

Computation and Language · Computer Science 2024-01-15 Benoist Wolleb , Romain Silvestri , Giorgos Vernikos , Ljiljana Dolamic , Andrei Popescu-Belis

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as…

Computation and Language · Computer Science 2025-10-03 Craig W. Schmidt , Varshini Reddy , Chris Tanner , Yuval Pinter

Pre-trained Models Perform the Best When Token Distributions Follow Zipf's Law

Tokenization is a fundamental step in natural language processing (NLP) and other sequence modeling domains, where the choice of vocabulary size significantly impacts model performance. Despite its importance, selecting an optimal…

Machine Learning · Computer Science 2025-07-31 Yanjin He , Qingkai Zeng , Meng Jiang

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Fast WordPiece Tokenization

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence)…

Computation and Language · Computer Science 2021-10-07 Xinying Song , Alex Salcianu , Yang Song , Dave Dopson , Denny Zhou

Compute Optimal Tokenization

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information…

Computation and Language · Computer Science 2026-05-27 Tomasz Limisiewicz , Artidoro Pagnoni , Srini Iyer , Mike Lewis , Sachin Mehta , Alisa Liu , Margaret Li , Gargi Ghosh , Luke Zettlemoyer

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not…

Computation and Language · Computer Science 2026-04-27 Noel Elias , Homa Esfahanizadeh , Kaan Kale , Sriram Vishwanath , Muriel Medard

Significance-Gain Pair Encoding for LLMs: A Statistical Alternative to Frequency-Based Subword Merging

Subword tokenization is a key design choice for modern language models, including large language models (LLMs), with byte- and character-level BPE serving as a widely used baseline. Standard BPE selects merges by raw pair frequency, which…

Computation and Language · Computer Science 2026-03-23 Azam Nouri

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the…

Computer Vision and Pattern Recognition · Computer Science 2024-11-18 Tim Elsner , Paula Usinger , Julius Nehring-Wirxel , Gregor Kobsik , Victor Czech , Yanjiang He , Isaak Lim , Leif Kobbelt

Making compression algorithms for Unicode text

The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes.…

Information Theory · Computer Science 2017-01-17 Adam Gleave , Christian Steinruecken

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Byte Pair Encoding for Symbolic Music

When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different…

Machine Learning · Computer Science 2023-11-14 Nathan Fradet , Nicolas Gutowski , Fabien Chhel , Jean-Pierre Briot

Byte Pair Encoding for Efficient Time Series Forecasting

Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in…

Machine Learning · Computer Science 2026-01-29 Leon Götz , Marcel Kollovieh , Stephan Günnemann , Leo Schwinn

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

Traditional lossless text compression preserves every byte, but its gains on natural language are often modest in realistic operating regimes. We study \emph{lossy semantic text compression}, where the encoder strategically deletes parts of…

Computation and Language · Computer Science 2026-05-29 Yuchun Zou , Junhong Tong , Jun Li

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Language models can largely benefit from efficient tokenization. However, they still mostly utilize the classical BPE algorithm, a simple and reliable method. This has been shown to cause such issues as under-trained tokens and sub-optimal…

Computation and Language · Computer Science 2024-09-10 Pavel Chizhov , Catherine Arnett , Elizaveta Korotkova , Ivan P. Yamshchikov

Beyond Text Compression: Evaluating Tokenizers Across Scales

The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately…

Computation and Language · Computer Science 2025-06-04 Jonas F. Lotz , António V. Lopes , Stephan Peitz , Hendra Setiawan , Leonardo Emili

A path to natural language through tokenisation and transformers

Natural languages exhibit striking regularities in their statistical structure, including notably the emergence of Zipf's and Heaps' laws. Despite this, it remains broadly unclear how these properties relate to the modern tokenisation…

Computation and Language · Computer Science 2026-01-08 David S. Berman , Alexander G. Stapleton

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter