Related papers: Multi-word Tokenization for Sequence Compression

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not…

Computation and Language · Computer Science 2026-04-27 Noel Elias , Homa Esfahanizadeh , Kaan Kale , Sriram Vishwanath , Muriel Medard

Visual-Word Tokenizer: Beyond Fixed Sets of Tokens in Vision Transformers

The cost of deploying vision transformers increasingly represents a barrier to wider industrial adoption. Existing compression techniques require additional end-to-end fine-tuning or incur a significant drawback to energy efficiency, making…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Leonidas Gee , Wing Yan Li , Viktoriia Sharmanska , Novi Quadrianto

Retrofitting Large Language Models with Dynamic Tokenization

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in…

Computation and Language · Computer Science 2022-04-25 Md Mofijul Islam , Gustavo Aguilar , Pragaash Ponnusamy , Clint Solomon Mathialagan , Chengyuan Ma , Chenlei Guo

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples…

Computation and Language · Computer Science 2025-05-26 Hongzhi Huang , Defa Zhu , Banggu Wu , Yutao Zeng , Ya Wang , Qiyang Min , Xun Zhou

Multidimensional Byte Pair Encoding: Shortened Sequences for Improved Visual Data Generation

In language processing, transformers benefit greatly from text being condensed. This is achieved through a larger vocabulary that captures word fragments instead of plain characters. This is often done with Byte Pair Encoding. In the…

Computer Vision and Pattern Recognition · Computer Science 2024-11-18 Tim Elsner , Paula Usinger , Julius Nehring-Wirxel , Gregor Kobsik , Victor Czech , Yanjiang He , Isaak Lim , Leif Kobbelt

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword…

Computation and Language · Computer Science 2026-05-15 Théo Gigant , Bowen Peng , Jeffrey Quesnelle

Improving Self Consistency in LLMs through Probabilistic Tokenization

Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a language model.…

Computation and Language · Computer Science 2024-07-08 Ashutosh Sathe , Divyanshu Aggarwal , Sunayana Sitaram

Accelerating Multilingual Language Model for Excessively Tokenized Languages

Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text…

Computation and Language · Computer Science 2024-08-07 Jimin Hong , Gibbeum Lee , Jaewoong Cho

A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models

Recent work on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead (Clark et al., 2022; Xue et al., 2022). However, these works mainly focus on reporting…

Computation and Language · Computer Science 2022-10-14 Jimin Sun , Patrick Fernandes , Xinyi Wang , Graham Neubig

Continuous Speech Tokenizer in Text To Speech

The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint…

Sound · Computer Science 2025-04-01 Yixing Li , Ruobing Xie , Xingwu Sun , Yu Cheng , Zhanhui Kang

ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model

Tokenizer is an essential component for large language models (LLMs), and a tokenizer with a high compression rate can improve the model's representation and processing efficiency. However, the tokenizer cannot ensure high compression rate…

Computation and Language · Computer Science 2024-10-08 Shuhao Gu , Mengdi Zhao , Bowen Zhang , Liangdong Wang , Jijie Li , Guang Liu

The Art of Breaking Words: Rethinking Multilingual Tokenizer Design

While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high…

Computation and Language · Computer Science 2025-08-12 Aamod Thakur , Ajay Nagpal , Atharva Savarkar , Kundeshwar Pundalik , Siddhesh Dosi , Piyush Sawarkar , Viraj Thakur , Rohit Saluja , Maunendra Sankar Desarkar , Ganesh Ramakrishnan

SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to…

Computation and Language · Computer Science 2025-08-22 Dong Liu , Yanxuan Yu

Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but it is unclear where…

Computation and Language · Computer Science 2025-12-01 Woojin Chung , Jeonghoon Kim

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this…

Computation and Language · Computer Science 2022-12-15 Nathan Godey , Roman Castagné , Éric de la Clergerie , Benoît Sagot

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive…

Computation and Language · Computer Science 2023-11-14 Siyang Liu , Naihao Deng , Sahand Sabour , Yilin Jia , Minlie Huang , Rada Mihalcea

SCOPE: A Generative Approach for LLM Prompt Compression

Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation…

Computation and Language · Computer Science 2025-08-25 Tinghui Zhang , Yifan Wang , Daisy Zhe Wang