Related papers: Batching BPE Tokenization Merges

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for…

Computation and Language · Computer Science 2025-11-11 Saketh Reddy Vemula , Sandipan Dandapat , Dipti Misra Sharma , Parameswari Krishnamurthy

Tokenization with Factorized Subword Encoding

In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel…

Computation and Language · Computer Science 2023-06-14 David Samuel , Lilja Øvrelid

Exploring Tokenization Strategies and Vocabulary Sizes for Enhanced Arabic Language Models

This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the…

Computation and Language · Computer Science 2024-09-23 Mohamed Taher Alrefaie , Nour Eldin Morsy , Nada Samir

Comparative analysis of subword tokenization approaches for Indian languages

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words…

Computation and Language · Computer Science 2025-05-23 Sudhansu Bala Das , Samujjal Choudhury , Tapas Kumar Mishra , Bidyut Kr. Patra

Tokenization Constraints in LLMs: A Study of Symbolic and Arithmetic Reasoning Limits

Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we…

Computation and Language · Computer Science 2025-05-21 Xiang Zhang , Juntai Cao , Jiaqi Wei , Yiwei Xu , Chenyu You

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

FreeMesh: Boosting Mesh Generation with Coordinates Merging

The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods. Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes…

Graphics · Computer Science 2025-05-21 Jian Liu , Haohan Weng , Biwen Lei , Xianghui Yang , Zibo Zhao , Zhuo Chen , Song Guo , Tao Han , Chunchao Guo

How BPE Affects Memorization in Transformers

Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings,…

Computation and Language · Computer Science 2021-12-03 Eugene Kharitonov , Marco Baroni , Dieuwke Hupkes

On the Effect of Token Merging on Pre-trained Models for Code

Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from…

Software Engineering · Computer Science 2025-07-22 Mootez Saad , Hao Li , Tushar Sharma , Ahmed E. Hassan

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with…

Computation and Language · Computer Science 2020-12-01 Alberto Poncelas , Jan Buts , James Hadley , Andy Way

Training Multilingual Pre-trained Language Model with Byte-level Subwords

The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora. One of…

Computation and Language · Computer Science 2021-06-04 Junqiu Wei , Qun Liu , Yinpeng Guo , Xin Jiang

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be…

Computation and Language · Computer Science 2024-06-25 Omer Goldman , Avi Caciularu , Matan Eyal , Kris Cao , Idan Szpektor , Reut Tsarfaty

Bayes Merging of Multiple Vocabularies for Scalable Image Retrieval

The Bag-of-Words (BoW) representation is well applied to recent state-of-the-art image retrieval works. Typically, multiple vocabularies are generated to correct quantization artifacts and improve recall. However, this routine is corrupted…

Computer Vision and Pattern Recognition · Computer Science 2014-04-15 Liang Zheng , Shengjin Wang , Wengang Zhou , Qi Tian

Practical Inner Codes for Batched Sparse Codes in Wireless Multihop Networks

Batched sparse (BATS) code is a promising technology for reliable data transmission in multi-hop wireless networks. As a BATS code consists of an outer code and an inner code that typically is a random linear network code, one main research…

Information Theory · Computer Science 2017-09-05 Zhiheng Zhou , Congduan Li , Shenghao Yang , Xuan Guang

Which Pieces Does Unigram Tokenization Really Need?

The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece…

Computation and Language · Computer Science 2026-04-13 Sander Land , Yuval Pinter

ECG-Byte: A Tokenizer for End-to-End Generative Electrocardiogram Language Modeling

Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding…

Computation and Language · Computer Science 2025-07-31 William Han , Chaojing Duan , Michael A. Rosenberg , Emerson Liu , Ding Zhao

M-Ped: Multi-Prompt Ensemble Decoding for Large Language Models

With the widespread application of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), enhancing their performance has become a research hotspot. This paper presents a novel multi-prompt ensemble decoding…

Computation and Language · Computer Science 2024-12-25 Jiaxin Guo , Daimeng Wei , Yuanchang Luo , Shimin Tao , Hengchao Shang , Zongyao Li , Shaojun Li , Jinlong Yang , Zhanglin Wu , Zhiqiang Rao , Hao Yang

Revisiting Regex Generation for Modeling Industrial Applications by Incorporating Byte Pair Encoder

Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel…

Neural and Evolutionary Computing · Computer Science 2020-06-25 Desheng Wang , Jiawei Liu , Xiang Qi , Baolin Sun , Peng Zhang

Rethinking LLM Ensembling from the Perspective of Mixture Models

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea…

Machine Learning · Computer Science 2026-05-26 Jiale Fu , Yuchu Jiang , Peijun Wu , Chonghan Liu , Joey Tianyi Zhou , Xu Yang

Fast WordPiece Tokenization

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence)…

Computation and Language · Computer Science 2021-10-07 Xinying Song , Alex Salcianu , Yang Song , Dave Dopson , Denny Zhou