English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we…

Computation and Language · Computer Science 2025-04-03 Mikkel Wildner Kildeberg , Emil Allerslev Schledermann , Nicolaj Larsen , Rob van der Goot

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons…

Computation and Language · Computer Science 2022-05-19 Jonathan H. Clark , Dan Garrette , Iulia Turc , John Wieting

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic…

Computation and Language · Computer Science 2026-05-29 Rohan Shravan

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

Ensembles, where multiple neural networks are trained individually and their predictions are averaged, have been shown to be widely successful for improving both the accuracy and predictive uncertainty of single neural networks. However, an…

Machine Learning · Computer Science 2020-02-21 Yeming Wen , Dustin Tran , Jimmy Ba

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at…

Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to…

Sound · Computer Science 2023-04-26 Adarsh Kumar , Pedro Sarmento

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize…

Computation and Language · Computer Science 2024-02-08 Gautier Dagan , Gabriel Synnaeve , Baptiste Rozière

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or…

Machine Learning · Computer Science 2026-05-21 Linus Kreitner , Paul Hager , Jonathan Mengedoht , Georgios Kaissis , Daniel Rueckert , Martin J. Menten

Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data…

Machine Learning · Computer Science 2021-12-21 Guilherme Cassales , Heitor Gomes , Albert Bifet , Bernhard Pfahringer , Hermes Senger

LPCNet is an efficient vocoder that combines linear prediction and deep neural network modules to keep the computational complexity low. In this work, we present two techniques to further reduce it's complexity, aiming for a low-cost LPCNet…

A class of two-bit bit flipping algorithms for decoding low-density parity-check codes over the binary symmetric channel was proposed in [1]. Initial results showed that decoders which employ a group of these algorithms operating in…

Information Theory · Computer Science 2012-05-22 Dung Viet Nguyen , Bane Vasic , Michael W. Marcellin

We present OnPair, a dictionary-based compression algorithm designed to meet the needs of in-memory database systems that require both high compression and fast random access. Existing methods either achieve strong compression ratios at…

Databases · Computer Science 2025-08-05 Francesco Gargiulo , Rossano Venturini

We introduce harmonization, an ensembling method that combines several "noisy" decoders to generate highly accurate decoding predictions. Harmonized ensembles of MWPM-based decoders achieve lower logical error rates than their individual…

Quantum Physics · Physics 2024-03-18 Noah Shutty , Michael Newman , Benjamin Villalonga

In neural machine translation (NMT), it is has become standard to translate using subword units to allow for an open vocabulary and improve accuracy on infrequent words. Byte-pair encoding (BPE) and its variants are the predominant approach…

Computation and Language · Computer Science 2018-10-23 Elizabeth Salesky , Andrew Runge , Alex Coda , Jan Niehues , Graham Neubig

A powerful way to improve performance in machine learning is to construct an ensemble that combines the predictions of multiple models. Ensemble methods are often much more accurate and lower variance than the individual classifiers that…

Machine Learning · Computer Science 2024-12-03 Antonio Macaluso , Luca Clissa , Stefano Lodi , Claudio Sartori

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of…

Cryptography and Security · Computer Science 2024-11-05 Chang Liu , Rebecca Saul , Yihao Sun , Edward Raff , Maya Fuchs , Townsend Southard Pantano , James Holt , Kristopher Micinski

DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly…

Genomics · Quantitative Biology 2025-12-23 Xiaoxiao Zhou , Zihan Wang , Jingbo Shang , Yang E. Li

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive…

Computation and Language · Computer Science 2023-11-14 Siyang Liu , Naihao Deng , Sahand Sabour , Yilin Jia , Minlie Huang , Rada Mihalcea