English
Related papers

Related papers: Retrofitting Large Language Models with Dynamic To…

200 papers

Transformers achieve unrivalled performance in modelling language, but remain inefficient in terms of memory and time complexity. A possible remedy is to reduce the sequence length in the intermediate layers by pooling fixed-length segments…

Computation and Language · Computer Science 2023-10-25 Piotr Nawrot , Jan Chorowski , Adrian Łańcucki , Edoardo M. Ponti

Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to…

Computation and Language · Computer Science 2024-07-18 Anton Schäfer , Thomas Hofmann , Imanol Schlag , Tiago Pimentel

Contextual adaptation in token embeddings plays a central role in determining how well language models maintain coherence and retain semantic relationships over extended text sequences. Static embeddings often impose constraints on lexical…

Computation and Language · Computer Science 2025-03-27 Koinis Vassilis , Godfrey Milbourne , Harriet Featherstone , Xanthe Peverell , Yorick Bletchley , Zachary Montford

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and…

Computation and Language · Computer Science 2025-10-29 Benjamin Minixhofer , Edoardo Maria Ponti , Ivan Vulić

When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and…

Computation and Language · Computer Science 2025-10-01 Christian Herold , Michael Kozielski , Nicholas Santavas , Yannick Versley , Shahram Khadivi

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high…

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text…

Computation and Language · Computer Science 2024-08-07 Jimin Hong , Gibbeum Lee , Jaewoong Cho

Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model's representation and processing efficiency. However, the tokenizer cannot ensure high compression rate…

Computation and Language · Computer Science 2024-10-08 Shuhao Gu , Mengdi Zhao , Bowen Zhang , Liangdong Wang , Jijie Li , Guang Liu

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of…

Computation and Language · Computer Science 2024-10-07 Yekun Chai , Yewei Fang , Qiwei Peng , Xuhong Li

Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level…

Computation and Language · Computer Science 2025-04-03 Julie Kallini , Shikhar Murty , Christopher D. Manning , Christopher Potts , Róbert Csordás

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large…

Computation and Language · Computer Science 2025-01-22 Pit Neitemeier , Björn Deiseroth , Constantin Eichenberg , Lukas Balles

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this…

Computation and Language · Computer Science 2022-12-15 Nathan Godey , Roman Castagné , Éric de la Clergerie , Benoît Sagot

Large language models (LLMs) have revolutionized various domains but still struggle with non-Latin scripts and low-resource languages. This paper addresses the critical challenge of improving multilingual performance without extensive…

Computation and Language · Computer Science 2025-01-08 Somnath Kumar , Vaibhav Balloli , Mercy Ranjit , Kabir Ahuja , Sunayana Sitaram , Kalika Bali , Tanuja Ganu , Akshay Nambi

Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The…

Computation and Language · Computer Science 2025-08-11 Alistair Dombrowski , Beatrix Engelhardt , Dimitri Fairbrother , Henry Evidail

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar

Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which…

Computation and Language · Computer Science 2024-10-17 Andrea Pinto , Tomer Galanti , Randall Balestriero

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially…

Computation and Language · Computer Science 2025-06-17 Zhong-Zhi Li , Xiao Liang , Zihao Tang , Lei Ji , Peijie Wang , Haotian Xu , Xing W , Haizhen Huang , Weiwei Deng , Yeyun Gong , Zhijiang Guo , Xiao Liu , Fei Yin , Cheng-Lin Liu
‹ Prev 1 2 3 10 Next ›