Related papers: Batching BPE Tokenization Merges

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair…

Sound · Computer Science 2024-10-30 Bohan Li , Feiyu Shen , Yiwei Guo , Shuai Wang , Xie Chen , Kai Yu

Efficient Cold-Start Recommendation via BPE Token-Level Embedding Initialization with LLM

The cold-start issue is the challenge when we talk about recommender systems, especially in the case when we do not have the past interaction data of new users or new items. Content-based features or hybrid solutions are common as…

Information Retrieval · Computer Science 2025-09-17 Yushang Zhao , Xinyue Han , Qian Leng , Qianyi Sun , Haotian Lyu , Chengrui Zhou

ByteSpan: Information-Driven Subword Tokenisation

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an…

Computation and Language · Computer Science 2025-06-24 Zébulon Goriely , Suchir Salhan , Pietro Lesci , Julius Cheng , Paula Buttery

MorphTok: Morphologically Grounded Tokenization for Indian Languages

Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm…

Computation and Language · Computer Science 2025-11-10 Maharaj Brahma , N J Karthika , Atul Singh , Devaraj Adiga , Smruti Bhate , Ganesh Ramakrishnan , Rohit Saluja , Maunendra Sankar Desarkar

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing…

Computation and Language · Computer Science 2025-07-25 Ganesh Sapkota , Md Hasibur Rahman

Optimized Tokenization for Transcribed Error Correction

The challenges facing speech recognition systems, such as variations in pronunciations, adverse audio conditions, and the scarcity of labeled data, emphasize the necessity for a post-processing step that corrects recurring errors. Previous…

Computation and Language · Computer Science 2023-10-18 Tomer Wullach , Shlomo E. Chazan

From Token to Token Pair: Efficient Prompt Compression for Large Language Models in Clinical Prediction

By processing electronic health records (EHRs) as natural language sequences, large language models (LLMs) have shown potential in clinical prediction tasks such as mortality prediction and phenotyping. However, longitudinal or highly…

Computation and Language · Computer Science 2026-05-13 Mingcheng Zhu , Zhiyao Luo , Yu Liu , Tingting Zhu

Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair…

Machine Learning · Statistics 2017-07-06 Bart van Merriënboer , Amartya Sanyal , Hugo Larochelle , Yoshua Bengio

Teaching Old Tokenizers New Words: Efficient Tokenizer Adaptation for Pre-trained Models

Tokenizer adaptation plays an important role in adapting pre-trained language models to new domains or languages. In this work, we address two complementary aspects of this process: vocabulary extension and pruning. The common approach to…

Computation and Language · Computer Science 2026-03-24 Taido Purason , Pavel Chizhov , Ivan P. Yamshchikov , Mark Fishel

Constructing a BPE Tokenization DFA

Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata (DFA)…

Formal Languages and Automata Theory · Computer Science 2025-05-27 Martin Berglund , Willeke Martens , Brink van der Merwe

MalwarePT: A Binary-Level Foundation Model for Malware Analysis

Automated malware analysis increasingly relies on machine learning, yet most existing methods remain task-specific and depend on handcrafted features or narrowly scoped models. Recent developments in binary-level foundation models suggest a…

Cryptography and Security · Computer Science 2026-05-19 Saastha Vasan , Yuzhou Nie , Kaie Chen , Yigitcan Kaya , Hojjat Aghakhani , Roman Vasilenko , Wenbo Guo , Christopher Kruegel , Giovanni Vigna

Sampling from Your Language Model One Byte at a Time

Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's…

Computation and Language · Computer Science 2026-05-08 Jonathan Hayase , Alisa Liu , Noah A. Smith , Sewoong Oh

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data, where the step-by-step thought process is explicitly outlined by text tokens. However, this results in lengthy inputs where many words…

Computation and Language · Computer Science 2025-09-03 DiJia Su , Hanlin Zhu , Yingchen Xu , Jiantao Jiao , Yuandong Tian , Qinqing Zheng

Morphological Typology in BPE Subword Productivity and Language Modeling

This study investigates the impact of morphological typology on tokenization and language modeling performance. We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized using…

Computation and Language · Computer Science 2024-11-01 Iñigo Parra

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages and scripts. While character- or byte-level…

Computation and Language · Computer Science 2025-04-03 Julie Kallini , Shikhar Murty , Christopher D. Manning , Christopher Potts , Róbert Csordás

Token-level Ensembling of Models with Different Vocabularies

Model ensembling is a technique to combine the predicted distributions of two or more models, often leading to improved robustness and performance. For ensembling in text generation, the next token's probability distribution is derived from…

Computation and Language · Computer Science 2025-03-03 Rachel Wicks , Kartik Ravisankar , Xinchen Yang , Philipp Koehn , Matt Post

Improving Tokenisation by Alternative Treatment of Spaces

Tokenisation is the first step in almost all NLP tasks, and state-of-the-art transformer-based language models all use subword tokenisation algorithms to process input text. Existing algorithms have problems, often producing tokenisations…

Computation and Language · Computer Science 2022-10-25 Edward Gow-Smith , Harish Tayyar Madabushi , Carolina Scarton , Aline Villavicencio

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating…

Computation and Language · Computer Science 2021-12-21 Sabrina J. Mielke , Zaid Alyafeai , Elizabeth Salesky , Colin Raffel , Manan Dey , Matthias Gallé , Arun Raja , Chenglei Si , Wilson Y. Lee , Benoît Sagot , Samson Tan

Free Lunch for Efficient Textual Commonsense Integration in Language Models

Recent years have witnessed the emergence of textual commonsense knowledge bases, aimed at providing more nuanced and context-rich knowledge. The integration of external commonsense into language models has been shown to be a key enabler in…

Computation and Language · Computer Science 2023-05-26 Wanyun Cui , Xingran Chen