Related papers: Multiscale sequence modeling with a learned dictio…

LBPE: Long-token-first Tokenization to Improve Large Language Models

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich…

Computation and Language · Computer Science 2024-11-11 Haoran Lian , Yizhe Xiong , Zijia Lin , Jianwei Niu , Shasha Mo , Hui Chen , Peng Liu , Guiguang Ding

Understanding and Mitigating Tokenization Bias in Language Models

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Self-Vocabularizing Training for Neural Machine Translation

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models…

Computation and Language · Computer Science 2025-04-02 Pin-Jie Lin , Ernie Chang , Yangyang Shi , Vikas Chandra

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

SuperBPE: Space Travel for Language Models

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the…

Computation and Language · Computer Science 2025-08-28 Alisa Liu , Jonathan Hayase , Valentin Hofmann , Sewoong Oh , Noah A. Smith , Yejin Choi

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating…

Computation and Language · Computer Science 2021-12-21 Sabrina J. Mielke , Zaid Alyafeai , Elizabeth Salesky , Colin Raffel , Manan Dey , Matthias Gallé , Arun Raja , Chenglei Si , Wilson Y. Lee , Benoît Sagot , Samson Tan

Modeling Target-Side Inflection in Neural Machine Translation

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization…

Computation and Language · Computer Science 2017-09-06 Aleš Tamchyna , Marion Weller-Di Marco , Alexander Fraser

How BPE Affects Memorization in Transformers

Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings,…

Computation and Language · Computer Science 2021-12-03 Eugene Kharitonov , Marco Baroni , Dieuwke Hupkes

Multi-Sense Language Modelling

The effectiveness of a language model is influenced by its token representations, which must encode contextual information and handle the same word form having a plurality of meanings (polysemy). Currently, none of the common language…

Computation and Language · Computer Science 2022-06-02 Andrea Lekkas , Peter Schneider-Kamp , Isabelle Augenstein

Multilingual Language Processing From Bytes

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary.…

Computation and Language · Computer Science 2016-04-05 Dan Gillick , Cliff Brunk , Oriol Vinyals , Amarnag Subramanya

Tokenization as Finite-State Transduction

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can efficiently encode…

Computation and Language · Computer Science 2024-10-22 Marco Cognetta , Naoaki Okazaki

Arithmetic with Language Models: from Memorization to Computation

A better understanding of the emergent computation and problem-solving capabilities of recent large language models is of paramount importance to further improve them and broaden their applicability. This work investigates how a language…

Artificial Intelligence · Computer Science 2024-08-05 Davide Maltoni , Matteo Ferrara

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Unsupervised Cross-lingual Word Embedding by Multilingual Neural Language Models

We propose an unsupervised method to obtain cross-lingual embeddings without any parallel data or pre-trained word embeddings. The proposed model, which we call multilingual neural language models, takes sentences of multiple languages as…

Computation and Language · Computer Science 2018-09-10 Takashi Wada , Tomoharu Iwata

Distribution-Aware Companding Quantization of Large Language Models

Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More…

Computation and Language · Computer Science 2026-03-03 Athul Radhakrishnan , Siddhant Mohan , Mahima Sachdeva

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the…

Computation and Language · Computer Science 2021-09-27 Gustavo Aguilar , Bryan McCann , Tong Niu , Nazneen Rajani , Nitish Keskar , Thamar Solorio

Byte Pair Encoding for Symbolic Music

When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different…

Machine Learning · Computer Science 2023-11-14 Nathan Fradet , Nicolas Gutowski , Fabien Chhel , Jean-Pierre Briot

Acoustic BPE for Speech Generation with Discrete Tokens

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of…

Sound · Computer Science 2024-01-17 Feiyu Shen , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu

Neural Machine Translation with Byte-Level Subwords

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can…

Computation and Language · Computer Science 2019-12-09 Changhan Wang , Kyunghyun Cho , Jiatao Gu