English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

Decreasing sequence length is a common way to accelerate transformers, but prior token reduction work often targets classification and reports proxy metrics rather than end-to-end latency. For semantic segmentation, token reduction is…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Simon Ravé , Pejman Rasti , David Rousseau

Deciphering historical substitution ciphers is a challenging problem. Example problems that have been previously studied include detecting cipher type, detecting plaintext language, and acquiring the substitution key for segmented ciphers.…

Computation and Language · Computer Science 2022-05-26 Nada Aldarrab , Jonathan May

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-27 Bo Li , Yu Zhang , Tara Sainath , Yonghui Wu , William Chan

Batch normalization (BN) is a fundamental unit in modern deep networks, in which a linear transformation module was designed for improving BN's flexibility of fitting complex data distributions. In this paper, we demonstrate properly…

Computer Vision and Pattern Recognition · Computer Science 2020-12-01 Yuhui Xu , Lingxi Xie , Cihang Xie , Jieru Mei , Siyuan Qiao , Wei Shen , Hongkai Xiong , Alan Yuille

State-of-the-art approaches for clustering high-dimensional data utilize deep auto-encoder architectures. Many of these networks require a large number of parameters and suffer from a lack of interpretability, due to the black-box nature of…

Machine Learning · Computer Science 2022-02-28 Alexander Lin , Andrew H. Song , Demba Ba

A major consideration in multilingual language modeling is how to best represent languages with diverse vocabularies and scripts. Although contemporary text encoding methods cover most of the world's writing systems, they exhibit bias…

Computation and Language · Computer Science 2024-11-12 Tomasz Limisiewicz , Terra Blevins , Hila Gonen , Orevaoghene Ahia , Luke Zettlemoyer

Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get…

Machine Learning · Computer Science 2023-02-13 Nilesh Prasad Pandey , Markus Nagel , Mart van Baalen , Yin Huang , Chirag Patel , Tijmen Blankevoort

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address these limitations and provide a more comprehensive evaluation, we introduce the Massive…

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are…

Computation and Language · Computer Science 2021-09-14 Antonis Maronikolakis , Philipp Dufter , Hinrich Schütze

As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic…

Computation and Language · Computer Science 2025-07-11 Qingyang Zhu , Xiang Hu , Pengyu Ji , Wei Wu , Kewei Tu

Subword tokenization algorithms used by Large Language Models are significantly more efficient and can independently build the necessary vocabulary of words and subwords without human intervention. However, those subwords do not always…

Computation and Language · Computer Science 2024-10-04 Óscar García-Sierra , Ana Fernández-Pampillón Cesteros , Miguel Ortega-Martín

Recent end-to-end automatic speech recognition (ASR) systems often utilize a Transformer-based acoustic encoder that generates embedding at a high frame rate. However, this design is inefficient, particularly for long speech signals due to…

Computation and Language · Computer Science 2023-06-29 Yuang Li , Yu Wu , Jinyu Li , Shujie Liu

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can…

Computation and Language · Computer Science 2022-03-09 Linting Xue , Aditya Barua , Noah Constant , Rami Al-Rfou , Sharan Narang , Mihir Kale , Adam Roberts , Colin Raffel

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Kaiyuan Li , Xiaoyue Chen , Chen Gao , Yong Li , Xinlei Chen

Common subword tokenization algorithms like BPE and UnigramLM assume that text can be split into meaningful units by concatenative measures alone. This is not true for languages such as Hebrew and Arabic, where morphology is encoded in…

Computation and Language · Computer Science 2025-06-04 Bar Gazit , Shaltiel Shmidman , Avi Shmidman , Yuval Pinter

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average…

Computation and Language · Computer Science 2026-05-12 Lin Zheng , Vasilisa Bashlovkina , Timothy Dozat , Dan Garrette , Laura Rimell , Joshua Maynez

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications…

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of…

Artificial Intelligence · Computer Science 2025-11-07 Ahmed Mostafa , Raisul Arefin Nahid , Samuel Mulder

Large language models (LLMs) have achieved remarkable success across various natural language processing tasks. However, most LLM models use traditional tokenizers like BPE and SentencePiece, which fail to capture the finer nuances of a…

Computation and Language · Computer Science 2025-05-26 Pramit Bhattacharyya , Arnab Bhattacharya

Tokenization - the practice of converting strings of characters from an alphabet into sequences of tokens over a vocabulary - is a critical step in the NLP pipeline. The use of token representations is widely credited with increased model…

Computation and Language · Computer Science 2025-04-04 Juan Luis Gastaldi , John Terilla , Luca Malagutti , Brian DuSell , Tim Vieira , Ryan Cotterell
‹ Prev 1 8 9 10 Next ›