Related papers: Byte-Pair Encoding for Text-to-SQL Generation

Theoretical Analysis of Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM)…

Data Structures and Algorithms · Computer Science 2024-11-14 László Kozma , Johannes Voderholzer

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such…

Computation and Language · Computer Science 2025-06-23 Yifan Hu , Frank Liang , Dachuan Zhao , Jonathan Geuter , Varshini Reddy , Craig W. Schmidt , Chris Tanner

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text,…

Information Retrieval · Computer Science 2024-10-03 Dinh-Viet-Toan Le , Louis Bigo , Mikaela Keller

Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition

Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge in various natural language and speech processing tasks. Recent research highlights the dependency of BPE subword…

Computation and Language · Computer Science 2024-01-30 Ahnaf Mozib Samin

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a…

Computation and Language · Computer Science 2024-11-14 Haoran Lian , Yizhe Xiong , Jianwei Niu , Shasha Mo , Zhenpeng Su , Zijia Lin , Hui Chen , Peng Liu , Jungong Han , Guiguang Ding

GraphBPE: Molecular Graphs Meet Byte-Pair Encoding

With the increasing attention to molecular machine learning, various innovations have been made in designing better models or proposing more comprehensive benchmarks. However, less is studied on the data preprocessing schedule for molecular…

Machine Learning · Computer Science 2024-07-30 Yuchen Shen , Barnabás Póczos

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers…

Computation and Language · Computer Science 2026-02-16 Saumitra Yadav , Manish Shrivastava

Code Completion using Neural Attention and Byte Pair Encoding

In this paper, we aim to do code completion based on implementing a Neural Network from Li et. al.. Our contribution is that we use an encoding that is in-between character and word encoding called Byte Pair Encoding (BPE). We use this on…

Computation and Language · Computer Science 2020-04-15 Youri Arkesteijn , Nikhil Saldanha , Bastijn Kostense

Byte Pair Encoding for Symbolic Music

When used with deep learning, the symbolic music modality is often coupled with language model architectures. To do so, the music needs to be tokenized, i.e. converted into a sequence of discrete tokens. This can be achieved by different…

Machine Learning · Computer Science 2023-11-14 Nathan Fradet , Nicolas Gutowski , Fabien Chhel , Jean-Pierre Briot

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by…

Artificial Intelligence · Computer Science 2025-03-11 Wanpeng Zhang , Zilong Xie , Yicheng Feng , Yijiang Li , Xingrun Xing , Sipeng Zheng , Zongqing Lu

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

Train It and Forget It: Merge Lists are Unnecessary for BPE Inference in Language Models

Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about…

Computation and Language · Computer Science 2025-08-12 Tomohiro Sawada , Kartik Goyal

LBPE: Long-token-first Tokenization to Improve Large Language Models

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich…

Computation and Language · Computer Science 2024-11-11 Haoran Lian , Yizhe Xiong , Zijia Lin , Jianwei Niu , Shasha Mo , Hui Chen , Peng Liu , Guiguang Ding

Modeling Target-Side Inflection in Neural Machine Translation

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization…

Computation and Language · Computer Science 2017-09-06 Aleš Tamchyna , Marion Weller-Di Marco , Alexander Fraser

Learning variable length units for SMT between related languages via Byte Pair Encoding

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best…

Computation and Language · Computer Science 2017-07-24 Anoop Kunchukuttan , Pushpak Bhattacharyya

EzSQL: An SQL intermediate representation for improving SQL-to-text Generation

The SQL-to-text generation task traditionally uses template base, Seq2Seq, tree-to-sequence, and graph-to-sequence models. Recent models take advantage of pre-trained generative language models for this task in the Seq2Seq framework.…

Computation and Language · Computer Science 2025-04-10 Meher Bhardwaj , Hrishikesh Ethari , Dennis Singh Moirangthem

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets

Tokenization imposes a fixed granularity on the input text, freezing how a language model operates on data and how far in the future it predicts. Byte Pair Encoding (BPE) and similar schemes split text once, build a static vocabulary, and…

Computation and Language · Computer Science 2025-06-18 Mathurin Videau , Badr Youbi Idrissi , Alessandro Leite , Marc Schoenauer , Olivier Teytaud , David Lopez-Paz

Acoustic BPE for Speech Generation with Discrete Tokens

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of…

Sound · Computer Science 2024-01-17 Feiyu Shen , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu