Related papers: BBPE16: UTF-16-based byte-level byte-pair encoding…

Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition

Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge in various natural language and speech processing tasks. Recent research highlights the dependency of BPE subword…

Computation and Language · Computer Science 2024-01-30 Ahnaf Mozib Samin

Bilingual End-to-End ASR with Byte-Level Subwords

In this paper, we investigate how the output representation of an end-to-end neural network affects multilingual automatic speech recognition (ASR). We study different representations including character-level, byte-level, byte pair…

Computation and Language · Computer Science 2022-05-03 Liuhui Deng , Roger Hsiao , Arnab Ghoshal

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

Neural Machine Translation with Byte-Level Subwords

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can…

Computation and Language · Computer Science 2019-12-09 Changhan Wang , Kyunghyun Cho , Jiatao Gu

Optimizing Byte-level Representation for End-to-end ASR

We propose a novel approach to optimizing a byte-level representation for end-to-end automatic speech recognition (ASR). Byte-level representation is often used by large scale multilingual ASR systems when the character set of the supported…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-06 Roger Hsiao , Liuhui Deng , Erik McDermott , Ruchir Travadi , Xiaodan Zhuang

Entropy-Driven Pre-Tokenization for Byte-Pair Encoding

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such…

Computation and Language · Computer Science 2025-06-23 Yifan Hu , Frank Liang , Dachuan Zhao , Jonathan Geuter , Varshini Reddy , Craig W. Schmidt , Chris Tanner

BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization,…

Computation and Language · Computer Science 2025-06-02 Sander Land , Catherine Arnett

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

Acoustic BPE for Speech Generation with Discrete Tokens

Discrete audio tokens derived from self-supervised learning models have gained widespread usage in speech generation. However, current practice of directly utilizing audio tokens poses challenges for sequence modeling due to the length of…

Sound · Computer Science 2024-01-17 Feiyu Shen , Yiwei Guo , Chenpeng Du , Xie Chen , Kai Yu

Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis

Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on…

Machine Learning · Computer Science 2025-11-25 Michael J. Bommarito

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair…

Sound · Computer Science 2024-10-30 Bohan Li , Feiyu Shen , Yiwei Guo , Shuai Wang , Xie Chen , Kai Yu

Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers…

Computation and Language · Computer Science 2026-02-16 Saumitra Yadav , Manish Shrivastava

LBPE: Long-token-first Tokenization to Improve Large Language Models

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich…

Computation and Language · Computer Science 2024-11-11 Haoran Lian , Yizhe Xiong , Zijia Lin , Jianwei Niu , Shasha Mo , Hui Chen , Peng Liu , Guiguang Ding

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

An investigation of phone-based subword units for end-to-end speech recognition

Phones and their context-dependent variants have been the standard modeling units for conventional speech recognition systems, while characters and subwords have demonstrated their effectiveness for end-to-end recognition systems. We…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-23 Weiran Wang , Guangsen Wang , Aadyot Bhatnagar , Yingbo Zhou , Caiming Xiong , Richard Socher

Analyzing Byte-Pair Encoding on Monophonic and Polyphonic Symbolic Music: A Focus on Musical Phrase Segmentation

Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text,…

Information Retrieval · Computer Science 2024-10-03 Dinh-Viet-Toan Le , Louis Bigo , Mikaela Keller

Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal

Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a…

Computation and Language · Computer Science 2024-11-14 Haoran Lian , Yizhe Xiong , Jianwei Niu , Shasha Mo , Zhenpeng Su , Zijia Lin , Hui Chen , Peng Liu , Jungong Han , Guiguang Ding

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Theoretical Analysis of Byte-Pair Encoding

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM)…

Data Structures and Algorithms · Computer Science 2024-11-14 László Kozma , Johannes Voderholzer

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner