Related papers: Tokenization with Split Trees

A Triadic Suffix Tokenization Scheme for Numerical Reasoning

Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic…

Computation and Language · Computer Science 2026-04-22 Olga Chetverina

When Every Token Counts: Optimal Segmentation for Low-Resource Language Models

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding…

Computation and Language · Computer Science 2025-05-05 Bharath Raj , Garvit Suri , Vikrant Dewangan , Raghav Sonavane

TOIST: Task Oriented Instance Segmentation Transformer with Noun-Pronoun Distillation

Current referring expression comprehension algorithms can effectively detect or segment objects indicated by nouns, but how to understand verb reference is still under-explored. As such, we study the challenging problem of task oriented…

Computer Vision and Pattern Recognition · Computer Science 2022-10-20 Pengfei Li , Beiwen Tian , Yongliang Shi , Xiaoxue Chen , Hao Zhao , Guyue Zhou , Ya-Qin Zhang

Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

This paper proposes a method to optimize tokenization for the performance improvement of already trained downstream models. Our method generates tokenization results attaining lower loss values of a given downstream model on the training…

Computation and Language · Computer Science 2023-04-24 Tatsuya Hiraoka , Tomoya Iwakura

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as…

Computer Vision and Pattern Recognition · Computer Science 2026-02-19 Hyunchan Moon , Cheonjun Park , Steven L. Waslander

Rethinking Tokenization: Crafting Better Tokenizers for Large Language Models

Tokenization significantly influences language models(LMs)' performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while…

Computation and Language · Computer Science 2024-03-04 Jinbiao Yang

Unsupervised Morphological Tree Tokenizer

As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic…

Computation and Language · Computer Science 2025-07-11 Qingyang Zhu , Xiang Hu , Pengyu Ji , Wei Wu , Kewei Tu

TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition

We present TokenSplit, a speech separation model that acts on discrete token sequences. The model is trained on multiple tasks simultaneously: separate and transcribe each speech source, and generate speech from text. The model operates on…

Sound · Computer Science 2023-08-22 Hakan Erdogan , Scott Wisdom , Xuankai Chang , Zalán Borsos , Marco Tagliasacchi , Neil Zeghidour , John R. Hershey

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in…

Computation and Language · Computer Science 2022-04-25 Md Mofijul Islam , Gustavo Aguilar , Pragaash Ponnusamy , Clint Solomon Mathialagan , Chengyuan Ma , Chenlei Guo

Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding

Existing speech tokenizers typically assign a fixed number of tokens per second, regardless of the varying information density or temporal fluctuations in the speech signal. This uniform token allocation mismatches the intrinsic structure…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-14 Rui-Chen Zheng , Wenrui Liu , Hui-Peng Du , Qinglin Zhang , Chong Deng , Qian Chen , Wen Wang , Yang Ai , Zhen-Hua Ling

Fast WordPiece Tokenization

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence)…

Computation and Language · Computer Science 2021-10-07 Xinying Song , Alex Salcianu , Yang Song , Dave Dopson , Denny Zhou

Efficient Pre-Training with Token Superposition

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training…

Computation and Language · Computer Science 2026-05-20 Bowen Peng , Théo Gigant , Jeffrey Quesnelle

Neural Token Segmentation for High Token-Internal Complexity

Tokenizing raw texts into word units is an essential pre-processing step for critical tasks in the NLP pipeline such as tagging, parsing, named entity recognition, and more. For most languages, this tokenization step straightforward.…

Computation and Language · Computer Science 2022-03-22 Idan Brusilovsky , Reut Tsarfaty

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

ByteSpan: Information-Driven Subword Tokenisation

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an…

Computation and Language · Computer Science 2025-06-24 Zébulon Goriely , Suchir Salhan , Pietro Lesci , Julius Cheng , Paula Buttery

CoST: Contrastive Quantization based Semantic Tokenization for Generative Recommendation

Embedding-based retrieval serves as a dominant approach to candidate item matching for industrial recommender systems. With the success of generative AI, generative retrieval has recently emerged as a new retrieval paradigm for…

Information Retrieval · Computer Science 2024-09-10 Jieming Zhu , Mengqun Jin , Qijiong Liu , Zexuan Qiu , Zhenhua Dong , Xiu Li

StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle…

Computation and Language · Computer Science 2026-04-22 Anya Sims , Thom Foster , Klara Kaleb , Tuan-Duy H. Nguyen , Joseph Lee , Jakob N. Foerster , Yee Whye Teh , Cong Lu

Comparative analysis of subword tokenization approaches for Indian languages

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words…

Computation and Language · Computer Science 2025-05-23 Sudhansu Bala Das , Samujjal Choudhury , Tapas Kumar Mishra , Bidyut Kr. Patra