Related papers: Batching BPE Tokenization Merges

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference,…

Computation and Language · Computer Science 2024-12-03 Jonathan Hayase , Alisa Liu , Yejin Choi , Sewoong Oh , Noah A. Smith

Byte Pair Encoding is Suboptimal for Language Model Pretraining

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair…

Computation and Language · Computer Science 2020-10-06 Kaj Bostrom , Greg Durrett

BBPE16: UTF-16-based byte-level byte-pair encoding for improved multilingual speech recognition

Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its…

Computation and Language · Computer Science 2026-02-03 Hyunsik Kim , Haeri Kim , Munhak Lee , Kyungmin Lee

Stop Taking Tokenizers for Granted: They Are Core Design Decisions in Large Language Models

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic…

Computation and Language · Computer Science 2026-01-27 Sawsan Alqahtani , Mir Tafseer Nayeem , Md Tahmid Rahman Laskar , Tasnim Mohiuddin , M Saiful Bari

Local Byte Fusion for Neural Machine Translation

Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in…

Computation and Language · Computer Science 2023-06-29 Makesh Narsimhan Sreedhar , Xiangpeng Wan , Yu Cheng , Junjie Hu

BPE-Dropout: Simple and Effective Subword Regularization

Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the…

Computation and Language · Computer Science 2020-05-05 Ivan Provilkov , Dmitrii Emelianenko , Elena Voita

Acoustic BPE for Speech Generation with Discrete Tokens

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of…

Sound · Computer Science 2024-01-17 Feiyu Shen , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu

Understanding and Mitigating Tokenization Bias in Language Models

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the…

Computation and Language · Computer Science 2021-09-27 Gustavo Aguilar , Bryan McCann , Tong Niu , Nazneen Rajani , Nitish Keskar , Thamar Solorio

From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak…

Computation and Language · Computer Science 2026-04-16 Pavel Chizhov , Egor Bogomolov , Ivan P. Yamshchikov

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant…

Computation and Language · Computer Science 2022-08-18 Ali Araabi , Christof Monz , Vlad Niculae

Linguistic Laws Meet Protein Sequences: A Comparative Analysis of Subword Tokenization Methods

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties.…

Computation and Language · Computer Science 2024-11-27 Burak Suyunu , Enes Taylan , Arzucan Özgür

Learning variable length units for SMT between related languages via Byte Pair Encoding

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best…

Computation and Language · Computer Science 2017-07-24 Anoop Kunchukuttan , Pushpak Bhattacharyya

BatchPrompt: Accomplish more with less

As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data…

Computation and Language · Computer Science 2024-07-16 Jianzhe Lin , Maurice Diesendruck , Liang Du , Robin Abraham

Byte Pair Encoding for Efficient Time Series Forecasting

Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in…

Machine Learning · Computer Science 2026-01-29 Leon Götz , Marcel Kollovieh , Stephan Günnemann , Leo Schwinn

MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme…

Computation and Language · Computer Science 2025-02-04 Ehsaneddin Asgari , Yassine El Kheir , Mohammad Ali Sadraei Javaheri

Self-Vocabularizing Training for Neural Machine Translation

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models…

Computation and Language · Computer Science 2025-04-02 Pin-Jie Lin , Ernie Chang , Yangyang Shi , Vikas Chandra

Modeling Target-Side Inflection in Neural Machine Translation

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization…

Computation and Language · Computer Science 2017-09-06 Aleš Tamchyna , Marion Weller-Di Marco , Alexander Fraser

A cost minimization approach to fix the vocabulary size in a tokenizer for an End-to-End ASR system

Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-06 Sunil Kumar Kopparapu , Ashish Panda

SuperBPE: Space Travel for Language Models

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the…

Computation and Language · Computer Science 2025-08-28 Alisa Liu , Jonathan Hayase , Valentin Hofmann , Sewoong Oh , Noah A. Smith , Yejin Choi