Related papers: Tokenization Is More Than Compression

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even…

Computation and Language · Computer Science 2020-10-07 Kyubyong Park , Joohong Lee , Seongbo Jang , Dawoon Jung

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties.…

Computation and Language · Computer Science 2024-11-27 Burak Suyunu , Enes Taylan , Arzucan Özgür

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic…

Computation and Language · Computer Science 2026-01-27 Sawsan Alqahtani , Mir Tafseer Nayeem , Md Tahmid Rahman Laskar , Tasnim Mohiuddin , M Saiful Bari

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be…

Computation and Language · Computer Science 2024-06-25 Omer Goldman , Avi Caciularu , Matan Eyal , Kris Cao , Idan Szpektor , Reut Tsarfaty

Improving Tokenisation by Alternative Treatment of Spaces

Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations…

Computation and Language · Computer Science 2022-10-25 Edward Gow-Smith , Harish Tayyar Madabushi , Carolina Scarton , Aline Villavicencio

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are…

Computation and Language · Computer Science 2025-11-10 Firoj Ahmmed Patwary , Abdullah Al Noman

Comparative analysis of subword tokenization approaches for Indian languages

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words…

Computation and Language · Computer Science 2025-05-23 Sudhansu Bala Das , Samujjal Choudhury , Tapas Kumar Mishra , Bidyut Kr. Patra

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm…

Computation and Language · Computer Science 2025-11-10 Maharaj Brahma , N J Karthika , Atul Singh , Devaraj Adiga , Smruti Bhate , Ganesh Ramakrishnan , Rohit Saluja , Maunendra Sankar Desarkar

Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier

Pre-tokenization, the initial step in many modern tokenization pipelines, segments text into smaller units called pretokens, typically splitting on whitespace and punctuation. While this process encourages having full, individual words as…

Computation and Language · Computer Science 2025-10-03 Craig W. Schmidt , Varshini Reddy , Chris Tanner , Yuval Pinter

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

How Important Is Tokenization in French Medical Masked Language Models?

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Adrien Bazoge , Beatrice Daille , Mickael Rouvier , Richard Dufour

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating…

Computation and Language · Computer Science 2021-12-21 Sabrina J. Mielke , Zaid Alyafeai , Elizabeth Salesky , Colin Raffel , Manan Dey , Matthias Gallé , Arun Raja , Chenglei Si , Wilson Y. Lee , Benoît Sagot , Samson Tan

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique…

Computation and Language · Computer Science 2022-10-12 Odunayo Ogundepo , Xinyu Zhang , Jimmy Lin

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

Impact of Tokenization on Language Models: An Analysis for Turkish

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be…

Computation and Language · Computer Science 2023-03-28 Cagri Toraman , Eyup Halit Yilmaz , Furkan Şahinuç , Oguzhan Ozcelik