English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

When processing a batch of graphs in machine learning models such as Graph Neural Networks (GNN), it is common to combine several small graphs into one overall graph to accelerate processing and remove or reduce the overhead of padding.…

Machine Learning · Computer Science 2022-09-20 Mario Michael Krell , Manuel Lopez , Sreenidhi Anand , Hatem Helal , Andrew William Fitzgibbon

We present a simple method to improve neural translation of a low-resource language pair using parallel data from a related, also low-resource, language pair. The method is based on the transfer method of Zoph et al., but whereas their…

Computation and Language · Computer Science 2017-09-22 Toan Q. Nguyen , David Chiang

Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies -…

Computation and Language · Computer Science 2025-09-30 Jinfan Frank Hu

Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over…

Computation and Language · Computer Science 2026-01-14 Yao Tang , Li Dong , Yaru Hao , Qingxiu Dong , Furu Wei , Jiatao Gu

Neural Machine Translation (NMT) in low-resource settings and of morphologically rich languages is made difficult in part by data sparsity of vocabulary words. Several methods have been used to help reduce this sparsity, notably Byte-Pair…

Computation and Language · Computer Science 2018-09-11 Pamela Shapiro , Kevin Duh

Sparse Mixture-of-Experts (MoE) allows scaling of language and vision models efficiently by activating only a small subset of experts per input. While this reduces computation, the large number of parameters still incurs substantial memory…

Tokenization is the foundational step in all large language model (LLM) pipelines, yet the dominant approach Byte Pair Encoding (BPE) and its variants is inherently script agnostic and optimized for English like morphology. For…

Computation and Language · Computer Science 2026-03-09 Prabhu Raja

The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We explore a setting where many different predictions are made on a single piece of text.…

Computation and Language · Computer Science 2020-04-30 Jingfei Du , Myle Ott , Haoran Li , Xing Zhou , Veselin Stoyanov

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is…

Machine Learning · Computer Science 2018-12-11 Avishek Anand , Megha Khosla , Jaspreet Singh , Jan-Hendrik Zab , Zijian Zhang

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient as they require…

Computation and Language · Computer Science 2023-08-01 Haiyue Song , Raj Dabre , Chenhui Chu , Sadao Kurohashi , Eiichiro Sumita

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with…

Information Theory · Computer Science 2026-02-27 Maximilian Kalcher

The effectiveness of Neural Machine Translation (NMT) models largely depends on the vocabulary used at training; small vocabularies can lead to out-of-vocabulary problems -- large ones, to memory issues. Subword (SW) tokenization has been…

Computation and Language · Computer Science 2023-03-02 J. Pourmostafa Roshan Sharami , D. Shterionov , P. Spronck

Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those…

Computation and Language · Computer Science 2025-06-25 N J Karthika , Maharaj Brahma , Rohit Saluja , Ganesh Ramakrishnan , Maunendra Sankar Desarkar

Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-06-23 Oded Green , Saher Odeh , Yitzhak Birk

Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like…

Computation and Language · Computer Science 2026-05-04 Liu Zai , Iraklis Klampanos

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most…

Computation and Language · Computer Science 2020-10-01 Charles Welch , Rada Mihalcea , Jonathan K. Kummerfeld

Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training. Several approaches have been proposed to extend the input length at…

Computation and Language · Computer Science 2025-06-30 Rohit Saxena , Hao Tang , Frank Keller

The theory of divide-and-conquer parallelization has been well-studied in the past, providing a solid basis upon which to explore different approaches to the parallelization of merge sort in Python. Python's simplicity and extensive…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-11-30 Alexandra Yang

Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1)…

Computation and Language · Computer Science 2025-09-25 Gagan Bhatia , Maxime Peyrard , Wei Zhao

We consider the problem of making machine translation more robust to character-level variation at the source side, such as typos. Existing methods achieve greater coverage by applying subword models such as byte-pair encoding (BPE) and…

Computation and Language · Computer Science 2019-02-06 Vladimir Karpukhin , Omer Levy , Jacob Eisenstein , Marjan Ghazvininejad