Related papers: Retrofitting Large Language Models with Dynamic To…

Efficient Transformers with Dynamic Token Pooling

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments…

Computation and Language · Computer Science 2023-10-25 Piotr Nawrot , Jan Chorowski , Adrian Łańcucki , Edoardo M. Ponti

On the Effect of (Near) Duplicate Subwords in Language Modelling

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to…

Computation and Language · Computer Science 2024-07-18 Anton Schäfer , Thomas Hofmann , Imanol Schlag , Tiago Pimentel

Lexical Manifold Reconfiguration in Large Language Models: A Novel Architectural Approach for Contextual Modulation

Contextual adaptation in token embeddings plays a central role in determining how well language models maintain coherence and retain semantic relationships over extended text sequences. Static embeddings often impose constraints on lexical…

Computation and Language · Computer Science 2025-03-27 Koinis Vassilis , Godfrey Milbourne , Harriet Featherstone , Xanthe Peverell , Yorick Bletchley , Zachary Montford

Zero-Shot Tokenizer Transfer

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and…

Computation and Language · Computer Science 2025-10-29 Benjamin Minixhofer , Edoardo Maria Ponti , Ivan Vulić

Vocabulary Customization for Efficient Domain-Specific LLM Deployment

When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and…

Computation and Language · Computer Science 2025-10-01 Christian Herold , Michael Kozielski , Nicholas Santavas , Yannick Versley , Shahram Khadivi

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high…

Computation and Language · Computer Science 2025-08-12 Aamod Thakur , Ajay Nagpal , Atharva Savarkar , Kundeshwar Pundalik , Siddhesh Dosi , Piyush Sawarkar , Viraj Thakur , Rohit Saluja , Maunendra Sankar Desarkar , Ganesh Ramakrishnan

Accelerating Multilingual Language Model for Excessively Tokenized Languages

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text…

Computation and Language · Computer Science 2024-08-07 Jimin Hong , Gibbeum Lee , Jaewoong Cho

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model's representation and processing efficiency. However, the tokenizer cannot ensure high compression rate…

Computation and Language · Computer Science 2024-10-08 Shuhao Gu , Mengdi Zhao , Bowen Zhang , Liangdong Wang , Jijie Li , Guang Liu

Tokenization Falling Short: On Subword Robustness in Large Language Models

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of…

Computation and Language · Computer Science 2024-10-07 Yekun Chai , Yewei Fang , Qiwei Peng , Xuhong Li

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level…

Computation and Language · Computer Science 2025-04-03 Julie Kallini , Shikhar Murty , Christopher D. Manning , Christopher Potts , Róbert Csordás

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large…

Computation and Language · Computer Science 2025-01-22 Pit Neitemeier , Björn Deiseroth , Constantin Eichenberg , Lukas Balles

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this…

Computation and Language · Computer Science 2022-12-15 Nathan Godey , Roman Castagné , Éric de la Clergerie , Benoît Sagot

Bridging the Language Gap: Dynamic Learning Strategies for Improving Multilingual Performance in LLMs

Large language models (LLMs) have revolutionized various domains but still struggle with non-Latin scripts and low-resource languages. This paper addresses the critical challenge of improving multilingual performance without extensive…

Computation and Language · Computer Science 2025-01-08 Somnath Kumar , Vaibhav Balloli , Mercy Ranjit , Kabir Ahuja , Sunayana Sitaram , Kalika Bali , Tanuja Ganu , Akshay Nambi

Contextual Morphogenesis in Large Language Models: A Novel Approach to Self-Organizing Token Representations

Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The…

Computation and Language · Computer Science 2025-08-11 Alistair Dombrowski , Beatrix Engelhardt , Dimitri Fairbrother , Henry Evidail

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar

The Fair Language Model Paradox

Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which…

Computation and Language · Computer Science 2024-10-17 Andrea Pinto , Tomer Galanti , Randall Balestriero

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially…

Computation and Language · Computer Science 2025-06-17 Zhong-Zhi Li , Xiao Liang , Zihao Tang , Lei Ji , Peijie Wang , Haotian Xu , Xing W , Haizhen Huang , Weiwei Deng , Yeyun Gong , Zhijiang Guo , Xiao Liu , Fei Yin , Cheng-Lin Liu