Related papers: ByteFlow: Language Modeling through Adaptive Byte …

Word-Level Representation From Bytes For Language Modeling

Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to…

Computation and Language · Computer Science 2022-11-24 Chu-Tak Lee , Qipeng Guo , Xipeng Qiu

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

ByteSpan: Information-Driven Subword Tokenisation

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an…

Computation and Language · Computer Science 2025-06-24 Zébulon Goriely , Suchir Salhan , Pietro Lesci , Julius Cheng , Paula Buttery

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Proxy Compression for Language Modeling

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces…

Computation and Language · Computer Science 2026-05-15 Lin Zheng , Xinyu Li , Qian Liu , Xiachong Feng , Lingpeng Kong

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models

Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. While learned subword tokenizers have become the de-facto standard, they present challenges such as large…

Computation and Language · Computer Science 2025-01-22 Pit Neitemeier , Björn Deiseroth , Constantin Eichenberg , Lukas Balles

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Modern deep learning approaches usually utilize modality-specific processing. For example, the most common deep learning approach to image classification involves decoding image file bytes into an RGB tensor which is passed into a neural…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Maxwell Horton , Sachin Mehta , Ali Farhadi , Mohammad Rastegari

Dynamic Chunking for End-to-End Hierarchical Sequence Modeling

Major progress on language models (LMs) in recent years has largely resulted from moving away from specialized models designed for specific tasks, to general models based on powerful architectures (e.g. the Transformer) that learn…

Machine Learning · Computer Science 2025-07-16 Sukjun Hwang , Brandon Wang , Albert Gu

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and…

Computation and Language · Computer Science 2025-01-08 Björn Deiseroth , Manuel Brack , Patrick Schramowski , Kristian Kersting , Samuel Weinbach

Local Byte Fusion for Neural Machine Translation

Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in…

Computation and Language · Computer Science 2023-06-29 Makesh Narsimhan Sreedhar , Xiangpeng Wan , Yu Cheng , Junjie Hu

Bolmo: Byteifying the Next Generation of Language Models

Recent advances in generative AI have been largely driven by large language models (LLMs), deep neural networks that operate over discrete units called tokens. To represent text, the vast majority of LLMs use words or word fragments as the…

Computation and Language · Computer Science 2026-02-10 Benjamin Minixhofer , Tyler Murray , Tomasz Limisiewicz , Anna Korhonen , Luke Zettlemoyer , Noah A. Smith , Edoardo M. Ponti , Luca Soldaini , Valentin Hofmann

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can…

Computation and Language · Computer Science 2022-03-09 Linting Xue , Aditya Barua , Noah Constant , Rami Al-Rfou , Sharan Narang , Mihir Kale , Adam Roberts , Colin Raffel

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents…

Computation and Language · Computer Science 2025-05-16 Shaurya Sharthak , Vinayak Pahalwan , Adithya Kamath , Adarsh Shirawalmath

An Information-Theoretic Perspective on LLM Tokenizers

Large language model (LLM) tokenizers act as structured compressors: by mapping text to discrete token sequences, they determine token count (and thus compute and context usage) and the statistical structure seen by downstream models.…

Information Theory · Computer Science 2026-01-15 Mete Erdogan , Abhiram Gorle , Shubham Chandak , Mert Pilanci , Tsachy Weissman

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Understanding and Mitigating Tokenization Bias in Language Models

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich