English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference,…

Computation and Language · Computer Science 2024-12-03 Jonathan Hayase , Alisa Liu , Yejin Choi , Sewoong Oh , Noah A. Smith

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair…

Computation and Language · Computer Science 2020-10-06 Kaj Bostrom , Greg Durrett

Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its…

Computation and Language · Computer Science 2026-02-03 Hyunsik Kim , Haeri Kim , Munhak Lee , Kyungmin Lee

Tokenization underlies every large language model, yet it remains an under-theorized and inconsistently designed component. Common subword approaches such as Byte Pair Encoding (BPE) offer scalability but often misalign with linguistic…

Computation and Language · Computer Science 2026-01-27 Sawsan Alqahtani , Mir Tafseer Nayeem , Md Tahmid Rahman Laskar , Tasnim Mohiuddin , M Saiful Bari

Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in…

Computation and Language · Computer Science 2023-06-29 Makesh Narsimhan Sreedhar , Xiangpeng Wan , Yu Cheng , Junjie Hu

Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the…

Computation and Language · Computer Science 2020-05-05 Ivan Provilkov , Dmitrii Emelianenko , Elena Voita

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of…

Sound · Computer Science 2024-01-17 Feiyu Shen , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu

State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction.…

Computation and Language · Computer Science 2024-07-09 Buu Phan , Marton Havasi , Matthew Muckley , Karen Ullrich

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the…

Computation and Language · Computer Science 2021-09-27 Gustavo Aguilar , Bryan McCann , Tong Niu , Nazneen Rajani , Nitish Keskar , Thamar Solorio

Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak…

Computation and Language · Computer Science 2026-04-16 Pavel Chizhov , Egor Bogomolov , Ivan P. Yamshchikov

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant…

Computation and Language · Computer Science 2022-08-18 Ali Araabi , Christof Monz , Vlad Niculae

Tokenization is a crucial step in processing protein sequences for machine learning models, as proteins are complex sequences of amino acids that require meaningful segmentation to capture their functional and structural properties.…

Computation and Language · Computer Science 2024-11-27 Burak Suyunu , Enes Taylan , Arzucan Özgür

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best…

Computation and Language · Computer Science 2017-07-24 Anoop Kunchukuttan , Pushpak Bhattacharyya

As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data…

Computation and Language · Computer Science 2024-07-16 Jianzhe Lin , Maurice Diesendruck , Liang Du , Robin Abraham

Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in…

Machine Learning · Computer Science 2026-01-29 Leon Götz , Marcel Kollovieh , Stephan Günnemann , Leo Schwinn

Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme…

Computation and Language · Computer Science 2025-02-04 Ehsaneddin Asgari , Yassine El Kheir , Mohammad Ali Sadraei Javaheri

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training. Empirically, we observe that trained translation models…

Computation and Language · Computer Science 2025-04-02 Pin-Jie Lin , Ernie Chang , Yangyang Shi , Vikas Chandra

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization…

Computation and Language · Computer Science 2017-09-06 Aleš Tamchyna , Marion Weller-Di Marco , Alexander Fraser

Unlike hybrid speech recognition systems where the use of tokens was restricted to phones, biphones or triphones the choice of tokens in the end-to-end ASR systems is derived from the text corpus of the training data. The use of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-06 Sunil Kumar Kopparapu , Ashish Panda

The assumption across nearly all language model (LM) tokenization schemes is that tokens should be subwords, i.e., contained within word boundaries. While providing a seemingly reasonable inductive bias, is this common practice limiting the…

Computation and Language · Computer Science 2025-08-28 Alisa Liu , Jonathan Hayase , Valentin Hofmann , Sewoong Oh , Noah A. Smith , Yejin Choi