Related papers: Batching BPE Tokenization Merges

MPM: Mutual Pair Merging for Efficient Vision Transformers

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Simon Ravé , Pejman Rasti , David Rousseau

Segmenting Numerical Substitution Ciphers

Deciphering historical substitution ciphers is a challenging problem. Example problems that have been previously studied include detecting cipher type, detecting plaintext language, and acquiring the substitution key for segmented ciphers.…

Computation and Language · Computer Science 2022-05-26 Nada Aldarrab , Jonathan May

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-27 Bo Li , Yu Zhang , Tara Sainath , Yonghui Wu , William Chan

Batch Normalization with Enhanced Linear Transformation

Batch normalization (BN) is a fundamental unit in modern deep networks, in which a linear transformation module was designed for improving BN's flexibility of fitting complex data distributions. In this paper, we demonstrate properly…

Computer Vision and Pattern Recognition · Computer Science 2020-12-01 Yuhui Xu , Lingxi Xie , Cihang Xie , Jieru Mei , Siyuan Qiao , Wei Shen , Hongkai Xiong , Alan Yuille

Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning

State-of-the-art approaches for clustering high-dimensional data utilize deep auto-encoder architectures. Many of these networks require a large number of parameters and suffer from a lack of interpretability, due to the black-box nature of…

Machine Learning · Computer Science 2022-02-28 Alexander Lin , Andrew H. Song , Demba Ba

MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias…

Computation and Language · Computer Science 2024-11-12 Tomasz Limisiewicz , Terra Blevins , Hila Gonen , Orevaoghene Ahia , Luke Zettlemoyer

A Practical Mixed Precision Algorithm for Post-Training Quantization

Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get…

Machine Learning · Computer Science 2023-02-13 Nilesh Prasad Pandey , Markus Nagel , Mart van Baalen , Yin Huang , Chirag Patel , Tijmen Blankevoort

MMTEB: Massive Multilingual Text Embedding Benchmark

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive…

Computation and Language · Computer Science 2025-11-14 Kenneth Enevoldsen , Isaac Chung , Imene Kerboua , Márton Kardos , Ashwin Mathur , David Stap , Jay Gala , Wissam Siblini , Dominik Krzemiński , Genta Indra Winata , Saba Sturua , Saiteja Utpala , Mathieu Ciancone , Marion Schaeffer , Gabriel Sequeira , Diganta Misra , Shreeya Dhakal , Jonathan Rystrøm , Roman Solomatin , Ömer Çağatan , Akash Kundu , Martin Bernstorff , Shitao Xiao , Akshita Sukhlecha , Bhavish Pahwa , Rafał Poświata , Kranthi Kiran GV , Shawon Ashraf , Daniel Auras , Björn Plüster , Jan Philipp Harries , Loïc Magne , Isabelle Mohr , Mariya Hendriksen , Dawei Zhu , Hippolyte Gisserot-Boukhlef , Tom Aarsen , Jan Kostkan , Konrad Wojtasik , Taemin Lee , Marek Šuppa , Crystina Zhang , Roberta Rocca , Mohammed Hamdy , Andrianos Michail , John Yang , Manuel Faysse , Aleksei Vatolin , Nandan Thakur , Manan Dey , Dipam Vasani , Pranjal Chitale , Simone Tedeschi , Nguyen Tai , Artem Snegirev , Michael Günther , Mengzhou Xia , Weijia Shi , Xing Han Lù , Jordan Clive , Gayatri Krishnakumar , Anna Maksimova , Silvan Wehrli , Maria Tikhonova , Henil Panchal , Aleksandr Abramov , Malte Ostendorff , Zheng Liu , Simon Clematide , Lester James Miranda , Alena Fenogenova , Guangyu Song , Ruqiya Bin Safi , Wen-Ding Li , Alessia Borghini , Federico Cassano , Hongjin Su , Jimmy Lin , Howard Yen , Lasse Hansen , Sara Hooker , Chenghao Xiao , Vaibhav Adlakha , Orion Weller , Siva Reddy , Niklas Muennighoff

Wine is Not v i n. -- On the Compatibility of Tokenizations Across Languages

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are…

Computation and Language · Computer Science 2021-09-14 Antonis Maronikolakis , Philipp Dufter , Hinrich Schütze

Unsupervised Morphological Tree Tokenizer

As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic…

Computation and Language · Computer Science 2025-07-11 Qingyang Zhu , Xiang Hu , Pengyu Ji , Wei Wu , Kewei Tu

Morphological evaluation of subwords vocabulary used by BETO language model

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always…

Computation and Language · Computer Science 2024-10-04 Óscar García-Sierra , Ana Fernández-Pampillón Cesteros , Miguel Ortega-Martín

Accelerating Transducers through Adjacent Token Merging

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to…

Computation and Language · Computer Science 2023-06-29 Yuang Li , Yu Wu , Jinyu Li , Shujie Liu

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can…

Computation and Language · Computer Science 2022-03-09 Linting Xue , Aditya Barua , Noah Constant , Rami Al-Rfou , Sharan Narang , Mihir Kale , Adam Roberts , Colin Raffel

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Kaiyuan Li , Xiaoyue Chen , Chen Gao , Yong Li , Xinlei Chen

Splintering Nonconcatenative Languages for Better Tokenization

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in…

Computation and Language · Computer Science 2025-06-04 Bar Gazit , Shaltiel Shmidman , Avi Shmidman , Yuval Pinter

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average…

Computation and Language · Computer Science 2026-05-12 Lin Zheng , Vasilisa Bashlovkina , Timothy Dozat , Dan Garrette , Laura Rimell , Joshua Maynez

BASS: Batched Attention-optimized Speculative Sampling

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications…

Machine Learning · Computer Science 2024-06-27 Haifeng Qian , Sujan Kumar Gonugondla , Sungsoo Ha , Mingyue Shang , Sanjay Krishna Gouda , Ramesh Nallapati , Sudipta Sengupta , Xiaofei Ma , Anoop Deoras

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of…

Artificial Intelligence · Computer Science 2025-11-07 Ahmed Mostafa , Raisul Arefin Nahid , Samuel Mulder

BanglaByT5: Byte-Level Modelling for Bangla

Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a…

Computation and Language · Computer Science 2025-05-26 Pramit Bhattacharyya , Arnab Bhattacharya

The Foundations of Tokenization: Statistical and Computational Concerns

Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model…

Computation and Language · Computer Science 2025-04-04 Juan Luis Gastaldi , John Terilla , Luca Malagutti , Brian DuSell , Tim Vieira , Ryan Cotterell