Related papers: Code Completion using Neural Attention and Byte Pa…

Theoretical Analysis of Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM)…

Data Structures and Algorithms · Computer Science 2024-11-14 László Kozma , Johannes Voderholzer

Code Completion with Neural Attention and Pointer Networks

Intelligent code completion has become an essential research task to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning…

Computation and Language · Computer Science 2019-09-12 Jian Li , Yue Wang , Michael R. Lyu , Irwin King

A Formal Perspective on Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE…

Computation and Language · Computer Science 2024-09-04 Vilém Zouhar , Clara Meister , Juan Luis Gastaldi , Li Du , Tim Vieira , Mrinmaya Sachan , Ryan Cotterell

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such…

Computation and Language · Computer Science 2025-06-23 Yifan Hu , Frank Liang , Dachuan Zhao , Jonathan Geuter , Varshini Reddy , Craig W. Schmidt , Chris Tanner

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text,…

Information Retrieval · Computer Science 2024-10-03 Dinh-Viet-Toan Le , Louis Bigo , Mikaela Keller

How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant…

Computation and Language · Computer Science 2022-08-18 Ali Araabi , Christof Monz , Vlad Niculae

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

LBPE: Long-token-first Tokenization to Improve Large Language Models

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich…

Computation and Language · Computer Science 2024-11-11 Haoran Lian , Yizhe Xiong , Zijia Lin , Jianwei Niu , Shasha Mo , Hui Chen , Peng Liu , Guiguang Ding

Towards Full-line Code Completion with Neural Language Models

A code completion system suggests future code elements to developers given a partially-complete code snippet. Code completion is one of the most useful features in Integrated Development Environments (IDEs). Currently, most code completion…

Software Engineering · Computer Science 2020-09-21 Wenhan Wang , Sijie Shen , Ge Li , Zhi Jin

Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition

Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge in various natural language and speech processing tasks. Recent research highlights the dependency of BPE subword…

Computation and Language · Computer Science 2024-01-30 Ahnaf Mozib Samin

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers…

Computation and Language · Computer Science 2026-02-16 Saumitra Yadav , Manish Shrivastava

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by…

Artificial Intelligence · Computer Science 2025-03-11 Wanpeng Zhang , Zilong Xie , Yicheng Feng , Yijiang Li , Xingrun Xing , Sipeng Zheng , Zongqing Lu

Byte Pair Encoding is Suboptimal for Language Model Pretraining

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair…

Computation and Language · Computer Science 2020-10-06 Kaj Bostrom , Greg Durrett

Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about…

Computation and Language · Computer Science 2025-08-12 Tomohiro Sawada , Kartik Goyal

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization,…

Computation and Language · Computer Science 2025-06-02 Sander Land , Catherine Arnett

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a…

Computation and Language · Computer Science 2024-11-14 Haoran Lian , Yizhe Xiong , Jianwei Niu , Shasha Mo , Zhenpeng Su , Zijia Lin , Hui Chen , Peng Liu , Jungong Han , Guiguang Ding

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference,…

Computation and Language · Computer Science 2024-12-03 Jonathan Hayase , Alisa Liu , Yejin Choi , Sewoong Oh , Noah A. Smith

Modeling Target-Side Inflection in Neural Machine Translation

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization…

Computation and Language · Computer Science 2017-09-06 Aleš Tamchyna , Marion Weller-Di Marco , Alexander Fraser