Related papers: Tokenization with Factorized Subword Encoding

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

GQ-VAE: A gated quantized VAE for learning variable length tokens

While most frontier models still use deterministic frequency-based tokenization algorithms such as byte-pair encoding (BPE), there has been significant recent work to design learned neural tokenizers. However, these schemes generally add to…

Machine Learning · Computer Science 2025-12-29 Theo Datta , Kayla Huang , Sham Kakade , David Brandfonbrener

Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge

The popular subword tokenizers of current language models, such as Byte-Pair Encoding (BPE), are known not to respect morpheme boundaries, which affects the downstream performance of the models. While many improved tokenization algorithms…

Computation and Language · Computer Science 2024-04-23 Khuyagbaatar Batsuren , Ekaterina Vylomova , Verna Dankers , Tsetsuukhei Delgerbaatar , Omri Uzan , Yuval Pinter , Gábor Bella

Factorized Visual Tokenization and Generation

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Zechen Bai , Jianxiong Gao , Ziteng Gao , Pichao Wang , Zheng Zhang , Tong He , Mike Zheng Shou

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Morphological Typology in BPE Subword Productivity and Language Modeling

This study investigates the impact of morphological typology on tokenization and language modeling performance. We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized using…

Computation and Language · Computer Science 2024-11-01 Iñigo Parra

Unsupervised Morphological Tree Tokenizer

As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic…

Computation and Language · Computer Science 2025-07-11 Qingyang Zhu , Xiang Hu , Pengyu Ji , Wei Wu , Kewei Tu

Lexically Grounded Subword Segmentation

We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings…

Computation and Language · Computer Science 2024-10-04 Jindřich Libovický , Jindřich Helcl

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Understanding and Mitigating Tokenization Bias in Language Models

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich

Comparative analysis of subword tokenization approaches for Indian languages

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words…

Computation and Language · Computer Science 2025-05-23 Sudhansu Bala Das , Samujjal Choudhury , Tapas Kumar Mishra , Bidyut Kr. Patra

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare…

Computation and Language · Computer Science 2025-10-20 Rares Dolga , Lucas Maystre , Tudor Berariu , David Barber

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the…

Computation and Language · Computer Science 2024-09-23 Mohamed Taher Alrefaie , Nour Eldin Morsy , Nada Samir

Analyzing Cognitive Plausibility of Subword Tokenization

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm…

Computation and Language · Computer Science 2023-10-23 Lisa Beinborn , Yuval Pinter

Morphological evaluation of subwords vocabulary used by BETO language model

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always…

Computation and Language · Computer Science 2024-10-04 Óscar García-Sierra , Ana Fernández-Pampillón Cesteros , Miguel Ortega-Martín

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic…

Computation and Language · Computer Science 2026-01-27 Sawsan Alqahtani , Mir Tafseer Nayeem , Md Tahmid Rahman Laskar , Tasnim Mohiuddin , M Saiful Bari

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties.…

Computation and Language · Computer Science 2024-11-27 Burak Suyunu , Enes Taylan , Arzucan Özgür

Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword…

Computation and Language · Computer Science 2026-03-31 Nuo Xu , Ahrii Kim