Related papers: Fast WordPiece Tokenization

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique…

Computation and Language · Computer Science 2022-10-12 Odunayo Ogundepo , Xinyu Zhang , Jimmy Lin

Formalizing BPE Tokenization

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace…

Formal Languages and Automata Theory · Computer Science 2023-09-19 Martin Berglund , Brink van der Merwe

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

MorphPiece : A Linguistic Tokenizer for Large Language Models

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a…

Computation and Language · Computer Science 2024-02-06 Haris Jabbar

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties.…

Computation and Language · Computer Science 2024-11-27 Burak Suyunu , Enes Taylan , Arzucan Özgür

Semantic Tokenizer for Enhanced Natural Language Processing

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword…

Computation and Language · Computer Science 2023-04-26 Sandeep Mehta , Darpan Shah , Ravindra Kulkarni , Cornelia Caragea

TreePiece: Faster Semantic Parsing via Tree Tokenization

Autoregressive (AR) encoder-decoder neural networks have proved successful in many NLP problems, including Semantic Parsing -- a task that translates natural language to machine-readable parse trees. However, the sequential prediction…

Computation and Language · Computer Science 2023-03-31 Sid Wang , Akshat Shrivastava , Sasha Livshits

Tokenization Matters: Improving Zero-Shot NER for Indic Languages

Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is…

Computation and Language · Computer Science 2025-04-25 Priyaranjan Pattnayak , Hitesh Laxmichand Patel , Amit Agarwal

Tokenization with Split Trees

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed…

Computation and Language · Computer Science 2026-05-28 Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner

Improving Tokenisation by Alternative Treatment of Spaces

Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations…

Computation and Language · Computer Science 2022-10-25 Edward Gow-Smith , Harish Tayyar Madabushi , Carolina Scarton , Aline Villavicencio

Impact of Tokenization on Language Models: An Analysis for Turkish

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be…

Computation and Language · Computer Science 2023-03-28 Cagri Toraman , Eyup Halit Yilmaz , Furkan Şahinuç , Oguzhan Ozcelik

MaxMatch-Dropout: Subword Regularization for WordPiece

We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization. The proposed method, MaxMatch-Dropout, randomly drops words in a search using the maximum matching algorithm. It realizes…

Computation and Language · Computer Science 2022-09-12 Tatsuya Hiraoka

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even…

Computation and Language · Computer Science 2020-10-07 Kyubyong Park , Joohong Lee , Seongbo Jang , Dawoon Jung

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Frequency-Ordered Tokenization for Better Text Compression

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with…

Information Theory · Computer Science 2026-02-27 Maximilian Kalcher

AraToken: Optimizing Arabic Tokenization with Normalization Pipeline and Language Extension for Qwen3

Tokenization is a critical preprocessing step for large language models (LLMs), directly impacting training efficiency and downstream performance. General-purpose tokenizers trained predominantly on English and Latin-script languages…

Computation and Language · Computer Science 2025-12-23 Mark Kashirskiy , Artiom Lipinski , Ilya Makarov

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are…

Computation and Language · Computer Science 2025-11-10 Firoj Ahmmed Patwary , Abdullah Al Noman

ImagePiece: Content-aware Re-tokenization for Efficient Image Recognition

Vision Transformers (ViTs) have achieved remarkable success in various computer vision tasks. However, ViTs have a huge computational cost due to their inherent reliance on multi-head self-attention (MHSA), prompting efforts to accelerate…

Computer Vision and Pattern Recognition · Computer Science 2024-12-24 Seungdong Yoa , Seungjun Lee , Hyeseung Cho , Bumsoo Kim , Woohyung Lim

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models

While many languages possess processes of joining two or more words to create compound words, previous studies have been typically limited only to languages with excessively productive compound formation (e.g., German, Dutch) and there is…

Computation and Language · Computer Science 2023-10-24 Benjamin Minixhofer , Jonas Pfeiffer , Ivan Vulić