English
Related papers

Related papers: Code Completion using Neural Attention and Byte Pa…

200 papers

Byte-Pair Encoding (BPE) is a widely used method for subword tokenization, with origins in grammar-based text compression. It is employed in a variety of language processing tasks such as machine translation or large language model (LLM)…

Data Structures and Algorithms · Computer Science 2024-11-14 László Kozma , Johannes Voderholzer

Intelligent code completion has become an essential research task to accelerate modern software development. To facilitate effective code completion for dynamically-typed programming languages, we apply neural language models by learning…

Computation and Language · Computer Science 2019-09-12 Jian Li , Yue Wang , Michael R. Lyu , Irwin King

Byte-Pair Encoding (BPE) is a popular algorithm used for tokenizing data in NLP, despite being devised initially as a compression method. BPE appears to be a greedy algorithm at face value, but the underlying optimization problem that BPE…

Computation and Language · Computer Science 2024-09-04 Vilém Zouhar , Clara Meister , Juan Luis Gastaldi , Li Du , Tim Vieira , Mrinmaya Sachan , Ryan Cotterell

Byte-Pair Encoding (BPE) has become a widely adopted subword tokenization method in modern language models due to its simplicity and strong empirical performance across downstream tasks. However, applying BPE to unsegmented languages such…

Computation and Language · Computer Science 2025-06-23 Yifan Hu , Frank Liang , Dachuan Zhao , Jonathan Geuter , Varshini Reddy , Craig W. Schmidt , Chris Tanner

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently…

Computation and Language · Computer Science 2025-08-25 Negar Foroutan , Clara Meister , Debjit Paul , Joel Niklaus , Sina Ahmadi , Antoine Bosselut , Rico Sennrich

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

Byte-Pair Encoding (BPE) is an algorithm commonly used in Natural Language Processing to build a vocabulary of subwords, which has been recently applied to symbolic music. Given that symbolic music can differ significantly from text,…

Information Retrieval · Computer Science 2024-10-03 Dinh-Viet-Toan Le , Louis Bigo , Mikaela Keller

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant…

Computation and Language · Computer Science 2022-08-18 Ali Araabi , Christof Monz , Vlad Niculae

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

The prevalent use of Byte Pair Encoding (BPE) in Large Language Models (LLMs) facilitates robust handling of subword units and avoids issues of out-of-vocabulary words. Despite its success, a critical challenge persists: long tokens, rich…

Computation and Language · Computer Science 2024-11-11 Haoran Lian , Yizhe Xiong , Zijia Lin , Jianwei Niu , Shasha Mo , Hui Chen , Peng Liu , Guiguang Ding

A code completion system suggests future code elements to developers given a partially-complete code snippet. Code completion is one of the most useful features in Integrated Development Environments (IDEs). Currently, most code completion…

Software Engineering · Computer Science 2020-09-21 Wenhan Wang , Sijie Shen , Ge Li , Zhi Jin

Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge in various natural language and speech processing tasks. Recent research highlights the dependency of BPE subword…

Computation and Language · Computer Science 2024-01-30 Ahnaf Mozib Samin

Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers…

Computation and Language · Computer Science 2026-02-16 Saumitra Yadav , Manish Shrivastava

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities. We introduce a novel image tokenizer that bridges this gap by…

Artificial Intelligence · Computer Science 2025-03-11 Wanpeng Zhang , Zilong Xie , Yicheng Feng , Yijiang Li , Xingrun Xing , Sipeng Zheng , Zongqing Lu

The success of pretrained transformer language models (LMs) in natural language processing has led to a wide range of pretraining setups. In particular, these models employ a variety of subword tokenization methods, most notably byte-pair…

Computation and Language · Computer Science 2020-10-06 Kaj Bostrom , Greg Durrett

Standard Byte-Pair Encoding (BPE) tokenization compresses text by pairing a learned token vocabulary with a detailed merge list. Recent work has shown that this merge list exposes a potential attack surface for extracting information about…

Computation and Language · Computer Science 2025-08-12 Tomohiro Sawada , Kartik Goyal

Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization,…

Computation and Language · Computer Science 2025-06-02 Sander Land , Catherine Arnett

Byte Pair Encoding (BPE) serves as a foundation method for text tokenization in the Natural Language Processing (NLP) field. Despite its wide adoption, the original BPE algorithm harbors an inherent flaw: it inadvertently introduces a…

Computation and Language · Computer Science 2024-11-14 Haoran Lian , Yizhe Xiong , Jianwei Niu , Shasha Mo , Zhenpeng Su , Zijia Lin , Hui Chen , Peng Liu , Jungong Han , Guiguang Ding

The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference,…

Computation and Language · Computer Science 2024-12-03 Jonathan Hayase , Alisa Liu , Yejin Choi , Sewoong Oh , Noah A. Smith

NMT systems have problems with large vocabulary sizes. Byte-pair encoding (BPE) is a popular approach to solving this problem, but while BPE allows the system to generate any target-side word, it does not enable effective generalization…

Computation and Language · Computer Science 2017-09-06 Aleš Tamchyna , Marion Weller-Di Marco , Alexander Fraser
‹ Prev 1 2 3 10 Next ›