Related papers: Batching BPE Tokenization Merges

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Wanpeng Zhang , Yicheng Feng , Hao Luo , Yijiang Li , Zihao Yue , Sipeng Zheng , Zongqing Lu

Finding Better Subword Segmentation for Neural Machine Translation

For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or…

Computation and Language · Computer Science 2018-07-26 Yingting Wu , Hai Zhao

Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out…

Computation and Language · Computer Science 2020-08-04 Xuanli He , Gholamreza Haffari , Mohammad Norouzi

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently…

Genomics · Quantitative Biology 2025-11-20 Siyuan Li , Kai Yu , Anna Wang , Zicheng Liu , Chang Yu , Jingbo Zhou , Qirong Yang , Yucheng Guo , Xiaoming Zhang , Stan Z. Li

Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation

Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned,…

Computation and Language · Computer Science 2024-08-13 Langlin Huang , Yang Feng

Learning to Merge Tokens in Vision Transformers

Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In…

Computer Vision and Pattern Recognition · Computer Science 2022-02-25 Cedric Renggli , André Susano Pinto , Neil Houlsby , Basil Mustafa , Joan Puigcerver , Carlos Riquelme

Joint Optimization of Tokenization and Downstream Model

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the…

Computation and Language · Computer Science 2021-05-27 Tatsuya Hiraoka , Sho Takase , Kei Uchiumi , Atsushi Keyaki , Naoaki Okazaki

Semantic Tokenizer for Enhanced Natural Language Processing

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword…

Computation and Language · Computer Science 2023-04-26 Sandeep Mehta , Darpan Shah , Ravindra Kulkarni , Cornelia Caragea

Tokenization with Split Trees

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed…

Computation and Language · Computer Science 2026-05-28 Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

An investigation of phone-based subword units for end-to-end speech recognition

Phones and their context-dependent variants have been the standard modeling units for conventional speech recognition systems, while characters and subwords have demonstrated their effectiveness for end-to-end recognition systems. We…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-23 Weiran Wang , Guangsen Wang , Aadyot Bhatnagar , Yingbo Zhou , Caiming Xiong , Richard Socher

Egalitarian Language Representation in Language Models: It All Begins with Tokenizers

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are…

Computation and Language · Computer Science 2025-01-22 Menan Velayuthan , Kengatharaiyer Sarveswaran

Tokenisation over Bounded Alphabets is Hard

Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size…

Computation and Language · Computer Science 2025-11-20 Violeta Kastreva , Philip Whittington , Dennis Komm , Tiago Pimentel

GPUTOK: GPU Accelerated Byte Level BPE Tokenization

As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that…

Computation and Language · Computer Science 2026-03-04 Venu Gopal Kadamba , Kanishkha Jaisankar

Length-MAX Tokenizer for Language Models

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we…

Computation and Language · Computer Science 2025-11-27 Dong Dong , Weijie Su

CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the…

Software Engineering · Computer Science 2022-06-15 Daoguang Zan , Bei Chen , Dejian Yang , Zeqi Lin , Minsu Kim , Bei Guan , Yongji Wang , Weizhu Chen , Jian-Guang Lou

MIPE: A Metric Independent Pipeline for Effective Code-Mixed NLG Evaluation

Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural…

Computation and Language · Computer Science 2021-07-27 Ayush Garg , Sammed S Kagi , Vivek Srivastava , Mayank Singh

Overcoming Vocabulary Constraints with Pixel-level Fallback

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models…

Computation and Language · Computer Science 2025-08-12 Jonas F. Lotz , Hendra Setiawan , Stephan Peitz , Yova Kementchedjhieva

Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance

Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding…

Computation and Language · Computer Science 2022-10-07 Mario Michael Krell , Matej Kosec , Sergio P. Perez , Andrew Fitzgibbon

Synthetic Data Generation and Joint Learning for Robust Code-Mixed Translation

The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational…

Computation and Language · Computer Science 2024-05-01 Kartik Kartik , Sanjana Soni , Anoop Kunchukuttan , Tanmoy Chakraborty , Md Shad Akhtar