Related papers: Scratchpad Patching: Decoupling Compute from Patch…

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased…

Computation and Language · Computer Science 2024-10-08 Kevin Slagle

Byte Latent Transformer: Patches Scale Better Than Tokens

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT…

Computation and Language · Computer Science 2024-12-16 Artidoro Pagnoni , Ram Pasunuru , Pedro Rodriguez , John Nguyen , Benjamin Muller , Margaret Li , Chunting Zhou , Lili Yu , Jason Weston , Luke Zettlemoyer , Gargi Ghosh , Mike Lewis , Ari Holtzman , Srinivasan Iyer

TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting

Transformer-based time series foundation models face a fundamental trade-off in choice of tokenization: point-wise embeddings preserve temporal fidelity but scale poorly with sequence length, whereas fixed-length patching improves…

Artificial Intelligence · Computer Science 2026-03-13 Sravan Kumar Ankireddy , Nikita Seleznev , Nam H. Nguyen , Yulun Wu , Senthil Kumar , Furong Huang , C. Bayan Bruss

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare…

Computation and Language · Computer Science 2025-10-20 Rares Dolga , Lucas Maystre , Tudor Berariu , David Barber

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More

Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 Feng Wang , Yaodong Yu , Guoyizhe Wei , Wei Shao , Yuyin Zhou , Alan Yuille , Cihang Xie

From Quarter to All: Accelerating Speculative LLM Decoding via Floating-Point Exponent Remapping and Parameter Sharing

Large language models achieve impressive performance across diverse tasks but exhibit high inference latency due to their large parameter sizes. While quantization reduces model size, it often leads to performance degradation compared to…

Hardware Architecture · Computer Science 2025-10-22 Yushu Zhao , Yubin Qin , Yang Wang , Xiaolong Yang , Huiming Han , Shaojun Wei , Yang Hu , Shouyi Yin

Learning by Distilling Context

Language models significantly benefit from context tokens, such as prompts or scratchpads. They perform better when prompted with informative instructions, and they acquire new reasoning capabilities by generating a scratch-pad before…

Computation and Language · Computer Science 2022-10-03 Charlie Snell , Dan Klein , Ruiqi Zhong

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

ByteSpan: Information-Driven Subword Tokenisation

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an…

Computation and Language · Computer Science 2025-06-24 Zébulon Goriely , Suchir Salhan , Pietro Lesci , Julius Cheng , Paula Buttery

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic…

Computation and Language · Computer Science 2026-05-29 Rohan Shravan

Word-Level Representation From Bytes For Language Modeling

Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to…

Computation and Language · Computer Science 2022-11-24 Chu-Tak Lee , Qipeng Guo , Xipeng Qiu

Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often…

Computation and Language · Computer Science 2026-03-06 Ofir Ben Shoham

Compute Optimal Tokenization

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information…

Computation and Language · Computer Science 2026-05-27 Tomasz Limisiewicz , Artidoro Pagnoni , Srini Iyer , Mike Lewis , Sachin Mehta , Alisa Liu , Margaret Li , Gargi Ghosh , Luke Zettlemoyer

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic…

Computation and Language · Computer Science 2026-01-27 Sawsan Alqahtani , Mir Tafseer Nayeem , Md Tahmid Rahman Laskar , Tasnim Mohiuddin , M Saiful Bari

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

Length-MAX Tokenizer for Language Models

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we…

Computation and Language · Computer Science 2025-11-27 Dong Dong , Weijie Su

Beyond Text Compression: Evaluating Tokenizers Across Scales

The choice of tokenizer can profoundly impact language model performance, yet accessible and reliable evaluations of tokenizer quality remain an open challenge. Inspired by scaling consistency, we show that smaller models can accurately…

Computation and Language · Computer Science 2025-06-04 Jonas F. Lotz , António V. Lopes , Stephan Peitz , Hendra Setiawan , Leonardo Emili