Related papers: Batching BPE Tokenization Merges

Tuple Packing: Efficient Batching of Small Graphs in Graph Neural Networks

When processing a batch of graphs in machine learning models such as Graph Neural Networks (GNN), it is common to combine several small graphs into one overall graph to accelerate processing and remove or reduce the overhead of padding.…

Machine Learning · Computer Science 2022-09-20 Mario Michael Krell , Manuel Lopez , Sreenidhi Anand , Hatem Helal , Andrew William Fitzgibbon

Transfer Learning across Low-Resource, Related Languages for Neural Machine Translation

We present a simple method to improve neural translation of a low-resource language pair using parallel data from a related, also low-resource, language pair. The method is based on the transfer method of Zoph et al., but whereas their…

Computation and Language · Computer Science 2017-09-22 Toan Q. Nguyen , David Chiang

Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish

Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies -…

Computation and Language · Computer Science 2025-09-30 Jinfan Frank Hu

Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge

Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over…

Computation and Language · Computer Science 2026-01-14 Yao Tang , Li Dong , Yaru Hao , Qingxiu Dong , Furu Wei , Jiatao Gu

BPE and CharCNNs for Translation of Morphology: A Cross-Lingual Comparison and Analysis

Neural Machine Translation (NMT) in low-resource settings and of morphologically rich languages is made difficult in part by data sparsity of vocabulary words. Several methods have been used to help reduce this sparsity, notably Byte-Pair…

Computation and Language · Computer Science 2018-09-11 Pamela Shapiro , Kevin Duh

Efficient Quantization of Mixture-of-Experts with Theoretical Generalization Guarantees

Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory…

Machine Learning · Computer Science 2026-04-09 Mohammed Nowaz Rabbani Chowdhury , Kaoutar El Maghraoui , Hsinyu Tsai , Naigang Wang , Geoffrey W. Burr , Liu Liu , Meng Wang

VerChol -- Grammar-First Tokenization for Agglutinative Languages

Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For…

Computation and Language · Computer Science 2026-03-09 Prabhu Raja

General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We explore a setting where many different predictions are made on a single piece of text.…

Computation and Language · Computer Science 2020-04-30 Jingfei Du , Myle Ott , Haoran Li , Xing Zhou , Veselin Stoyanov

Asynchronous Training of Word Embeddings for Large Text Corpora

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is…

Machine Learning · Computer Science 2018-12-11 Avishek Anand , Megha Khosla , Jaspreet Singh , Jan-Hendrik Zab , Zijian Zhang

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require…

Computation and Language · Computer Science 2023-08-01 Haiyue Song , Raj Dabre , Chenhui Chu , Sadao Kurohashi , Eiichiro Sumita

Frequency-Ordered Tokenization for Better Text Compression

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with…

Information Theory · Computer Science 2026-02-27 Maximilian Kalcher

A Systematic Analysis of Vocabulary and BPE Settings for Optimal Fine-tuning of NMT: A Case Study of In-domain Translation

The effectiveness of Neural Machine Translation (NMT) models largely depends on the vocabulary used at training; small vocabularies can lead to out-of-vocabulary problems -- large ones, to memory issues. Subword (SW) tokenization has been…

Computation and Language · Computer Science 2023-03-02 J. Pourmostafa Roshan Sharami , D. Shterionov , P. Spronck

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those…

Computation and Language · Computer Science 2025-06-25 N J Karthika , Maharaj Brahma , Rohit Saluja , Ganesh Ramakrishnan , Maunendra Sankar Desarkar

Merge Path - A Visually Intuitive Approach to Parallel Merging

Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-06-23 Oded Green , Saher Odeh , Yitzhak Birk

Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices

Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like…

Computation and Language · Computer Science 2026-05-04 Liu Zai , Iraklis Klampanos

Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most…

Computation and Language · Computer Science 2020-10-01 Charles Welch , Rada Mihalcea , Jonathan K. Kummerfeld

End-to-End Long Document Summarization using Gradient Caching

Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at…

Computation and Language · Computer Science 2025-06-30 Rohit Saxena , Hao Tang , Frank Keller

Approaches to the Parallelization of Merge Sort in Python

The theory of divide-and-conquer parallelization has been well-studied in the past, providing a solid basis upon which to explore different approaches to the parallelization of merge sort in Python. Python's simplicity and extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-30 Alexandra Yang

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1)…

Computation and Language · Computer Science 2025-09-25 Gagan Bhatia , Maxime Peyrard , Wei Zhao

Training on Synthetic Noise Improves Robustness to Natural Noise in Machine Translation

We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and…

Computation and Language · Computer Science 2019-02-06 Vladimir Karpukhin , Omer Levy , Jacob Eisenstein , Marjan Ghazvininejad