Related papers: GPUTOK: GPU Accelerated Byte Level BPE Tokenizatio…

BlockBPE: Parallel BPE Tokenization

Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of…

Computation and Language · Computer Science 2025-07-17 Amos You

DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models

Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked…

Genomics · Quantitative Biology 2026-01-12 Eliatan Niktab , Hardip Patel

Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE

Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are…

Computation and Language · Computer Science 2025-11-10 Firoj Ahmmed Patwary , Abdullah Al Noman

Tokenization Is More Than Compression

Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has…

Computation and Language · Computer Science 2024-10-08 Craig W. Schmidt , Varshini Reddy , Haoran Zhang , Alec Alameddine , Omri Uzan , Yuval Pinter , Chris Tanner

ScribeTokens: Fixed-Vocabulary Tokenization of Digital Ink

Digital ink -- the coordinate stream captured from stylus or touch input -- lacks a unified representation. Continuous vector representations produce long sequences and suffer from training instability, while existing token representations…

Computer Vision and Pattern Recognition · Computer Science 2026-03-04 Douglass Wang

Faster Superword Tokenization

Byte Pair Encoding (BPE) is a widely used tokenization algorithm, whose tokens cannot extend across pre-tokenization boundaries, functionally limiting it to representing at most full words. The BoundlessBPE and SuperBPE algorithms extend…

Computation and Language · Computer Science 2026-04-08 Craig W. Schmidt , Chris Tanner , Yuval Pinter

From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare…

Computation and Language · Computer Science 2025-10-20 Rares Dolga , Lucas Maystre , Tudor Berariu , David Barber

Binary BPE: A Family of Cross-Platform Tokenizers for Binary Analysis

Sequence models for binary analysis are bottlenecked by byte-level tokenization: raw bytes waste precious context window capacity for transformers and other neural network architectures, and many existing text-oriented tokenizers fail on…

Machine Learning · Computer Science 2025-11-25 Michael J. Bommarito

SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to…

Computation and Language · Computer Science 2025-08-22 Dong Liu , Yanxuan Yu

ThunderKittens: Simple, Fast, and Adorable AI Kernels

The challenge of mapping AI architectures to GPU hardware is creating a critical bottleneck in AI progress. Despite substantial efforts, hand-written custom kernels fail to meet their theoretical performance thresholds, even on…

Machine Learning · Computer Science 2024-10-29 Benjamin F. Spector , Simran Arora , Aaryan Singhal , Daniel Y. Fu , Christopher Ré

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate…

Computation and Language · Computer Science 2026-02-05 Yike Sun , Haotong Yang , Zhouchen Lin , Muhan Zhang

Length-MAX Tokenizer for Language Models

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we…

Computation and Language · Computer Science 2025-11-27 Dong Dong , Weijie Su

AdaptBPE: From General Purpose to Specialized Tokenizers

Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly…

Computation and Language · Computer Science 2026-01-30 Vijini Liyanage , François Yvon

Batching BPE Tokenization Merges

The Byte Pair Encoding algorithm can be safely batched to merge hundreds of pairs of tokens at a time when building up a tokenizer's vocabulary. This technique combined with reducing the memory footprint of text used in vocabulary training…

Computation and Language · Computer Science 2024-08-12 Alexander P. Morgan

Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs

Despite foreseeing tremendous speedups over conventional deep neural networks, the performance advantage of binarized neural networks (BNNs) has merely been showcased on general-purpose processors such as CPUs and GPUs. In fact, due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-12-16 Ang Li , Simon Su

Safe and Practical GPU Acceleration in TrustZone

We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-08 Heejin Park , Felix Xiaozhu Lin

Back to Bytes: Revisiting Tokenization Through UTF-8

We present UTF8Tokenizer, a minimalist byte-level tokenizer that maps text exactly to IDs corresponding to the bytes underlying the text's UTF-8 encoding (e.g., byte x09 is token ID 9). Unlike prior byte-level approaches (Xue et al., 2021;…

Computation and Language · Computer Science 2025-10-21 Amit Moryossef , Clara Meister , Pavel Stepachev , Desmond Elliott

Chunked TabPFN: Exact Training-Free In-Context Learning for Long-Context Tabular Data

TabPFN v2 achieves better results than tree-based models on several tabular benchmarks, which is notable since tree-based models are usually the strongest choice for tabular data. However, it cannot handle more than 10K context tokens…

Machine Learning · Computer Science 2025-09-18 Renat Sergazinov , Shao-An Yin

Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices

Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like…

Computation and Language · Computer Science 2026-05-04 Liu Zai , Iraklis Klampanos

SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that…

Computation and Language · Computer Science 2025-08-26 Andrei-Valentin Tănase , Elena Pelican