Related papers: Batching BPE Tokenization Merges

From Sm{\o}r-re-br{\o}d to Subwords: Training LLMs on Danish, One Morpheme at a Time

The best performing transformer-based language models use subword tokenization techniques, such as Byte-Pair-Encoding (BPE). However, these approaches often overlook linguistic principles, such as morphological segmentation, which we…

Computation and Language · Computer Science 2025-04-03 Mikkel Wildner Kildeberg , Emil Allerslev Schledermann , Nicolaj Larsen , Rob van der Goot

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons…

Computation and Language · Computer Science 2022-05-19 Jonathan H. Clark , Dan Garrette , Iulia Turc , John Wieting

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic…

Computation and Language · Computer Science 2026-05-29 Rohan Shravan

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning

Ensembles, where multiple neural networks are trained individually and their predictions are averaged, have been shown to be widely successful for improving both the accuracy and predictive uncertainty of single neural networks. However, an…

Machine Learning · Computer Science 2020-02-21 Yeming Wen , Dustin Tran , Jimmy Ba

Language Models over Canonical Byte-Pair Encodings

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at…

Computation and Language · Computer Science 2025-06-10 Tim Vieira , Tianyu Liu , Clemente Pasti , Yahya Emara , Brian DuSell , Benjamin LeBrun , Mario Giulianelli , Juan Luis Gastaldi , Timothy J. O'Donnell , Ryan Cotterell

From Words to Music: A Study of Subword Tokenization Techniques in Symbolic Music Generation

Subword tokenization has been widely successful in text-based natural language processing (NLP) tasks with Transformer-based models. As Transformer models become increasingly popular in symbolic music-related studies, it is imperative to…

Sound · Computer Science 2023-04-26 Adarsh Kumar , Pedro Sarmento

Retrofitting Large Language Models with Dynamic Tokenization

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

Getting the most out of your tokenizer for pre-training and domain adaptation

Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize…

Computation and Language · Computer Science 2024-02-08 Gautier Dagan , Gabriel Synnaeve , Baptiste Rozière

Efficient numeracy in language models through single-token number embeddings

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or…

Machine Learning · Computer Science 2026-05-21 Linus Kreitner , Paul Hager , Jonathan Mengedoht , Georgios Kaissis , Daniel Rueckert , Martin J. Menten

Improving the performance of bagging ensembles for data streams through mini-batching

Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data…

Machine Learning · Computer Science 2021-12-21 Guilherme Cassales , Heitor Gomes , Albert Bifet , Bernhard Pfahringer , Hermes Senger

Bunched LPCNet : Vocoder for Low-cost Neural Text-To-Speech Systems

LPCNet is an efficient vocoder that combines linear prediction and deep neural network modules to keep the computational complexity low. In this work, we present two techniques to further reduce it's complexity, aiming for a low-cost LPCNet…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-12 Ravichander Vipperla , Sangjun Park , Kihyun Choo , Samin Ishtiaq , Kyoungbo Min , Sourav Bhattacharya , Abhinav Mehrotra , Alberto Gil C. P. Ramos , Nicholas D. Lane

Selecting Two-Bit Bit Flipping Algorithms for Collective Error Correction

A class of two-bit bit flipping algorithms for decoding low-density parity-check codes over the binary symmetric channel was proposed in [1]. Initial results showed that decoders which employ a group of these algorithms operating in…

Information Theory · Computer Science 2012-05-22 Dung Viet Nguyen , Bane Vasic , Michael W. Marcellin

OnPair: Short Strings Compression for Fast Random Access

We present OnPair, a dictionary-based compression algorithm designed to meet the needs of in-memory database systems that require both high compression and fast random access. Existing methods either achieve strong compression ratios at…

Databases · Computer Science 2025-08-05 Francesco Gargiulo , Rossano Venturini

Efficient near-optimal decoding of the surface code through ensembling

We introduce harmonization, an ensembling method that combines several "noisy" decoders to generate highly accurate decoding predictions. Harmonized ensembles of MWPM-based decoders achieve lower logical error rates than their individual…

Quantum Physics · Physics 2024-03-18 Noah Shutty , Michael Newman , Benjamin Villalonga

Optimizing Segmentation Granularity for Neural Machine Translation

In neural machine translation (NMT), it is has become standard to translate using subword units to allow for an open vocabulary and improve accuracy on infrequent words. Byte-pair encoding (BPE) and its variants are the predominant approach…

Computation and Language · Computer Science 2018-10-23 Elizabeth Salesky , Andrew Runge , Alex Coda , Jan Niehues , Graham Neubig

Quantum Ensemble for Classification

A powerful way to improve performance in machine learning is to construct an ensemble that combines the predictions of multiple models. Ensemble methods are often much more accurate and lower variance than the individual classifiers that…

Machine Learning · Computer Science 2024-12-03 Antonio Macaluso , Luca Clissa , Stefano Lodi , Claudio Sartori

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpora of malicious binaries, obtaining high-quality corpora of…

Cryptography and Security · Computer Science 2024-11-05 Chang Liu , Rebecca Saul , Yihao Sun , Edward Raff , Maya Fuchs , Townsend Southard Pantano , James Holt , Kristopher Micinski

DNAMotifTokenizer: Towards Biologically Informed Tokenization of Genomic Sequences

DNA language models have advanced genomics, but their downstream performance varies widely due to differences in tokenization, pretraining data, and architecture. We argue that a major bottleneck lies in tokenizing sparse and unevenly…

Genomics · Quantitative Biology 2025-12-23 Xiaoxiao Zhou , Zihan Wang , Jingbo Shang , Yang E. Li

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive…

Computation and Language · Computer Science 2023-11-14 Siyang Liu , Naihao Deng , Sahand Sabour , Yilin Jia , Minlie Huang , Rada Mihalcea