Related papers: Unsupervised Morphological Tree Tokenizer

Lexically Grounded Subword Segmentation

We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings…

Computation and Language · Computer Science 2024-10-04 Jindřich Libovický , Jindřich Helcl

Biomedical Language Models are Robust to Sub-optimal Tokenization

As opposed to general English, many concepts in biomedical terminology have been designed in recent history by biomedical professionals with the goal of being precise and concise. This is often achieved by concatenating meaningful…

Computation and Language · Computer Science 2023-07-11 Bernal Jiménez Gutiérrez , Huan Sun , Yu Su

Impact of Tokenization on Language Models: An Analysis for Turkish

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be…

Computation and Language · Computer Science 2023-03-28 Cagri Toraman , Eyup Halit Yilmaz , Furkan Şahinuç , Oguzhan Ozcelik

From Sm{\o}r-re-br{\o}d to Subwords: Training LLMs on Danish, One Morpheme at a Time

The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we…

Computation and Language · Computer Science 2025-04-03 Mikkel Wildner Kildeberg , Emil Allerslev Schledermann , Nicolaj Larsen , Rob van der Goot

MorphPiece : A Linguistic Tokenizer for Large Language Models

Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consideration to the linguistic features. I propose a…

Computation and Language · Computer Science 2024-02-06 Haris Jabbar

TAMS: Translation-Assisted Morphological Segmentation

Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes. This is a core task in language documentation, and NLP systems have the potential to…

Computation and Language · Computer Science 2024-10-16 Enora Rice , Ali Marashian , Luke Gessler , Alexis Palmer , Katharina von der Wense

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Probabilistic Modelling of Morphologically Rich Languages

This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often…

Computation and Language · Computer Science 2015-08-19 Jan A. Botha

From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.…

Computation and Language · Computer Science 2023-05-31 Li Sun , Florian Luisier , Kayhan Batmanghelich , Dinei Florencio , Cha Zhang

How Important Is Tokenization in French Medical Masked Language Models?

Subword tokenization has become the prevailing standard in the field of natural language processing (NLP) over recent years, primarily due to the widespread utilization of pre-trained language models. This shift began with Byte-Pair…

Computation and Language · Computer Science 2024-06-11 Yanis Labrak , Adrien Bazoge , Beatrice Daille , Mickael Rouvier , Richard Dufour

A Joint Model for Word Embedding and Word Morphology

This paper presents a joint model for performing unsupervised morphological analysis on words, and learning a character-level composition function from morphemes to word embeddings. Our model splits individual words into segments, and…

Computation and Language · Computer Science 2016-06-09 Kris Cao , Marek Rei

An Information-Theoretic Perspective on LLM Tokenizers

Large language model (LLM) tokenizers act as structured compressors: by mapping text to discrete token sequences, they determine token count (and thus compute and context usage) and the statistical structure seen by downstream models.…

Information Theory · Computer Science 2026-01-15 Mete Erdogan , Abhiram Gorle , Shubham Chandak , Mert Pilanci , Tsachy Weissman

The Foundations of Tokenization: Statistical and Computational Concerns

Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model…

Computation and Language · Computer Science 2025-04-04 Juan Luis Gastaldi , John Terilla , Luca Malagutti , Brian DuSell , Tim Vieira , Ryan Cotterell

Training Language Models with homotokens Leads to Delayed Overfitting

Subword tokenization introduces a computational layer in language models where many distinct token sequences decode to the same surface form and preserve meaning, yet induce different internal computations. Despite this non-uniqueness,…

Computation and Language · Computer Science 2026-01-14 Adrian Cosma , Stefan Ruseti , Emilian Radoi , Mihai Dascalu

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme…

Computation and Language · Computer Science 2025-02-04 Ehsaneddin Asgari , Yassine El Kheir , Mohammad Ali Sadraei Javaheri

Contextual Morphogenesis in Large Language Models: A Novel Approach to Self-Organizing Token Representations

Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The…

Computation and Language · Computer Science 2025-08-11 Alistair Dombrowski , Beatrix Engelhardt , Dimitri Fairbrother , Henry Evidail

Tokenization Falling Short: On Subword Robustness in Large Language Models

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of…

Computation and Language · Computer Science 2024-10-07 Yekun Chai , Yewei Fang , Qiwei Peng , Xuhong Li

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large…

Computation and Language · Computer Science 2025-01-22 Pit Neitemeier , Björn Deiseroth , Constantin Eichenberg , Lukas Balles

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

Byte BPE Tokenization as an Inverse string Homomorphism

Tokenization is an important preprocessing step in the training and inference of large language models (LLMs). While there has been extensive research on the expressive power of the neural achitectures used in LLMs, the impact of…

Computation and Language · Computer Science 2024-12-05 Saibo Geng , Sankalp Gambhir , Chris Wendler , Robert West