English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

Tokenization is a critical component of Natural Language Processing (NLP), especially for low resource languages, where subword segmentation influences vocabulary structure and downstream task accuracy. Although Byte Pair Encoding (BPE) is…

Computation and Language · Computer Science 2025-04-25 Priyaranjan Pattnayak , Hitesh Laxmichand Patel , Amit Agarwal

Tokenization is widely used in large language models because it significantly improves performance. However, tokenization imposes several disadvantages, such as performance biases, increased adversarial vulnerability, decreased…

Computation and Language · Computer Science 2024-10-08 Kevin Slagle

Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be…

Computation and Language · Computer Science 2023-03-28 Cagri Toraman , Eyup Halit Yilmaz , Furkan Şahinuç , Oguzhan Ozcelik

Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence…

Computation and Language · Computer Science 2022-07-27 Sławomir Dadas

Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label…

Computation and Language · Computer Science 2023-10-20 Ercong Nie , Helmut Schmid , Hinrich Schütze

We introduce a new balanced assignment of experts (BASE) layer for large language models that greatly simplifies existing high capacity sparse layers. Sparse layers can dramatically improve the efficiency of training and inference by…

Computation and Language · Computer Science 2021-04-01 Mike Lewis , Shruti Bhosale , Tim Dettmers , Naman Goyal , Luke Zettlemoyer

Scaling test-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models (LLMs). However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets,…

Computation and Language · Computer Science 2026-04-23 Amirul Rahman , Aisha Karim , Kenji Nakamura , Yi-Fan Ng

We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialization of…

Computation and Language · Computer Science 2024-11-12 Sreeram Vennam , Anish Joishy , Ponnurangam Kumaraguru

Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes…

Computation and Language · Computer Science 2021-04-13 Uri Shaham , Omer Levy

This study explores the tokenization of multitrack sheet music in ABC notation, introducing two methods--bar-stream and line-stream patching. We compare these methods against existing techniques, including bar patching, byte patching, and…

Sound · Computer Science 2024-10-24 Yashan Wang , Shangda Wu , Xingjian Du , Maosong Sun

Subword tokenization is a commonly used input pre-processing step in most recent NLP models. However, it limits the models' ability to leverage end-to-end task learning. Its frequency-based vocabulary creation compromises tokenization in…

Computation and Language · Computer Science 2022-04-25 Md Mofijul Islam , Gustavo Aguilar , Pragaash Ponnusamy , Clint Solomon Mathialagan , Chengyuan Ma , Chenlei Guo

Data representation remains a fundamental challenge in machine learning, particularly when adapting sequence-based architectures like Transformers and Large Language Models (LLMs) for structured tabular data. Existing methods often fail to…

Machine Learning · Computer Science 2025-08-05 Kayvan Karim , Hani Ragab Hassen. Hadj Batatia

Target encoding is an effective encoding technique of categorical variables and is often used in machine learning systems for processing tabular data sets with mixed numeric and categorical variables. Recently en enhanced version of this…

Machine Learning · Computer Science 2020-11-24 Michael Larionov

In this work, we investigate the positional encoding methods used in language pre-training (e.g., BERT) and identify several problems in the existing formulations. First, we show that in the absolute positional encoding, the addition…

Computation and Language · Computer Science 2021-03-16 Guolin Ke , Di He , Tie-Yan Liu

Despite recent advances in subquadratic attention mechanisms or state-space models, processing long token sequences still imposes significant computational requirements. Token merging has emerged as a solution to increase computational…

Machine Learning · Computer Science 2025-08-06 Leon Götz , Marcel Kollovieh , Stephan Günnemann , Leo Schwinn

Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding.…

Computation and Language · Computer Science 2023-07-06 Sonal Sannigrahi , Josef van Genabith , Cristina Espana-Bonet

Batched sparse (BATS) codes were proposed as a reliable communication solution for networks with packet loss. In the finite-length regime, the error probability of BATS codes under belief propagation (BP) decoding has been studied in the…

Information Theory · Computer Science 2025-02-12 Mingyang Zhu , Shenghao Yang , Ming Jiang , Chunming Zhao

As neural machine translation (NMT) is not easily amenable to explicit correction of errors, incorporating pre-specified translations into NMT is widely regarded as a non-trivial challenge. In this paper, we propose and explore three…

Computation and Language · Computer Science 2019-12-03 Tao Wang , Shaohui Kuang , Deyi Xiong , António Branco

Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Mouxiao Huang , Borui Jiang , Dehua Zheng , Hailin Hu , Kai Han , Xinghao Chen

Batch normalization (BN) is a popular and ubiquitous method in deep learning that has been shown to decrease training time and improve generalization performance of neural networks. Despite its success, BN is not theoretically well…

Machine Learning · Computer Science 2022-01-21 Susanna Lange , Kyle Helfrich , Qiang Ye