Related papers: Unsupervised Tokenization Learning

Self-tuning hyper-parameters for unsupervised cross-lingual tokenization

We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the…

Computation and Language · Computer Science 2023-04-05 Anton Kolonin

A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models

Recent work on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead (Clark et al., 2022; Xue et al., 2022). However, these works mainly focus on reporting…

Computation and Language · Computer Science 2022-10-14 Jimin Sun , Patrick Fernandes , Xinyi Wang , Graham Neubig

Beyond Text Compression: Evaluating Tokenizers Across Scales

The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately…

Computation and Language · Computer Science 2025-06-04 Jonas F. Lotz , António V. Lopes , Stephan Peitz , Hendra Setiawan , Leonardo Emili

Understanding and Mitigating Tokenization Bias in Language Models

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in…

Computation and Language · Computer Science 2022-04-25 Md Mofijul Islam , Gustavo Aguilar , Pragaash Ponnusamy , Clint Solomon Mathialagan , Chengyuan Ma , Chenlei Guo

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples…

Computation and Language · Computer Science 2025-05-26 Hongzhi Huang , Defa Zhu , Banggu Wu , Yutao Zeng , Ya Wang , Qiyang Min , Xun Zhou

Beyond Literal Token Overlap: Token Alignability for Multilinguality

Previous work has considered token overlap, or even similarity of token distributions, as predictors for multilinguality and cross-lingual knowledge transfer in language models. However, these very literal metrics assign large distances to…

Computation and Language · Computer Science 2025-02-11 Katharina Hämmerl , Tomasz Limisiewicz , Jindřich Libovický , Alexander Fraser

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar

You should evaluate your language model on marginal likelihood over tokenisations

Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is…

Computation and Language · Computer Science 2021-09-22 Kris Cao , Laura Rimell

An In-Vitro Study on Cross-Lingual Generalization in Language Models

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated…

Computation and Language · Computer Science 2026-05-27 Adrian Cosma

Unsupervised Cross-Lingual Transfer of Structured Predictors without Source Data

Providing technologies to communities or domains where training data is scarce or protected e.g., for privacy reasons, is becoming increasingly important. To that end, we generalise methods for unsupervised transfer from multiple input…

Computation and Language · Computer Science 2021-10-11 Kemal Kurniawan , Lea Frermann , Philip Schulz , Trevor Cohn

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and…

Computation and Language · Computer Science 2025-01-08 Björn Deiseroth , Manuel Brack , Patrick Schramowski , Kristian Kersting , Samuel Weinbach

Explaining and Mitigating Crosslingual Tokenizer Inequities

The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during training and increases costs at…

Computation and Language · Computer Science 2025-10-28 Catherine Arnett , Tyler A. Chang , Stella Biderman , Benjamin K. Bergen

To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer

Choosing an appropriate tokenization scheme is often a bottleneck in low-resource cross-lingual transfer. To understand the downstream implications of text representation choices, we perform a comparative analysis on language models having…

Computation and Language · Computer Science 2023-10-13 Md Mushfiqur Rahman , Fardin Ahsan Sakib , Fahim Faisal , Antonios Anastasopoulos

False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed…

Computation and Language · Computer Science 2025-09-26 Julie Kallini , Dan Jurafsky , Christopher Potts , Martijn Bartelds

On Systematic Style Differences between Unsupervised and Supervised MT and an Application for High-Resource Machine Translation

Modern unsupervised machine translation (MT) systems reach reasonable translation quality under clean and controlled data conditions. As the performance gap between supervised and unsupervised MT narrows, it is interesting to ask whether…

Computation and Language · Computer Science 2022-04-15 Kelly Marchisio , Markus Freitag , David Grangier

Overcoming Vocabulary Constraints with Pixel-level Fallback

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models…

Computation and Language · Computer Science 2025-08-12 Jonas F. Lotz , Hendra Setiawan , Stephan Peitz , Yova Kementchedjhieva

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where…

Computation and Language · Computer Science 2025-12-01 Woojin Chung , Jeonghoon Kim