English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

The relationship between tokenizer algorithm (e.g., Byte-Pair Encoding (BPE), Unigram), morphological alignment, tokenization quality (e.g., compression efficiency), and downstream performance remains largely unclear, particularly for…

Computation and Language · Computer Science 2025-11-11 Saketh Reddy Vemula , Sandipan Dandapat , Dipti Misra Sharma , Parameswari Krishnamurthy

In recent years, language models have become increasingly larger and more complex. However, the input representations for these models continue to rely on simple and greedy subword tokenization methods. In this paper, we propose a novel…

Computation and Language · Computer Science 2023-06-14 David Samuel , Lilja Øvrelid

This paper presents a comprehensive examination of the impact of tokenization strategies and vocabulary sizes on the performance of Arabic language models in downstream natural language processing tasks. Our investigation focused on the…

Computation and Language · Computer Science 2024-09-23 Mohamed Taher Alrefaie , Nour Eldin Morsy , Nada Samir

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words…

Computation and Language · Computer Science 2025-05-23 Sudhansu Bala Das , Samujjal Choudhury , Tapas Kumar Mishra , Bidyut Kr. Patra

Tokenization is the first - and often underappreciated - layer of computation in language models. While Chain-of-Thought (CoT) prompting enables transformer models to approximate recurrent computation by externalizing intermediate steps, we…

Computation and Language · Computer Science 2025-05-21 Xiang Zhang , Juntai Cao , Jiaqi Wei , Yiwei Xu , Chenyu You

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods. Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes…

Graphics · Computer Science 2025-05-21 Jian Liu , Haohan Weng , Biwen Lei , Xianghui Yang , Zibo Zhao , Zhuo Chen , Song Guo , Tao Han , Chunchao Guo

Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings,…

Computation and Language · Computer Science 2021-12-03 Eugene Kharitonov , Marco Baroni , Dieuwke Hupkes

Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from…

Software Engineering · Computer Science 2025-07-22 Mootez Saad , Hao Li , Tushar Sharma , Ahmed E. Hassan

Building Machine Translation (MT) systems for low-resource languages remains challenging. For many language pairs, parallel data are not widely available, and in such cases MT models do not achieve results comparable to those seen with…

Computation and Language · Computer Science 2020-12-01 Alberto Poncelas , Jan Buts , James Hadley , Andy Way

The pre-trained language models have achieved great successes in various natural language understanding (NLU) tasks due to its capacity to capture the deep contextualized information in text by pre-training on large-scale corpora. One of…

Computation and Language · Computer Science 2021-06-04 Junqiu Wei , Qun Liu , Yinpeng Guo , Xin Jiang

Despite it being the cornerstone of BPE, the most common tokenization algorithm, the importance of compression in the tokenization process is still unclear. In this paper, we argue for the theoretical importance of compression, that can be…

Computation and Language · Computer Science 2024-06-25 Omer Goldman , Avi Caciularu , Matan Eyal , Kris Cao , Idan Szpektor , Reut Tsarfaty

The Bag-of-Words (BoW) representation is well applied to recent state-of-the-art image retrieval works. Typically, multiple vocabularies are generated to correct quantization artifacts and improve recall. However, this routine is corrupted…

Computer Vision and Pattern Recognition · Computer Science 2014-04-15 Liang Zheng , Shengjin Wang , Wengang Zhou , Qi Tian

Batched sparse (BATS) code is a promising technology for reliable data transmission in multi-hop wireless networks. As a BATS code consists of an outer code and an inner code that typically is a random linear network code, one main research…

Information Theory · Computer Science 2017-09-05 Zhiheng Zhou , Congduan Li , Shenghao Yang , Xuan Guang

The Unigram tokenization algorithm offers a probabilistic alternative to the greedy heuristics of Byte-Pair Encoding. Despite its theoretical elegance, its implementation in practice is complex, limiting its adoption to the SentencePiece…

Computation and Language · Computer Science 2026-04-13 Sander Land , Yuval Pinter

Large Language Models (LLMs) have demonstrated exceptional versatility across domains, including applications to electrocardiograms (ECGs). A growing body of work focuses on generating text from multi-channeled ECG signals and corresponding…

Computation and Language · Computer Science 2025-07-31 William Han , Chaojing Duan , Michael A. Rosenberg , Emerson Liu , Ding Zhao

With the widespread application of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), enhancing their performance has become a research hotspot. This paper presents a novel multi-prompt ensemble decoding…

Computation and Language · Computer Science 2024-12-25 Jiaxin Guo , Daimeng Wei , Yuanchang Luo , Shimin Tao , Hengchao Shang , Zongyao Li , Shaojun Li , Jinlong Yang , Zhanglin Wu , Zhiqiang Rao , Hao Yang

Regular expression is important for many natural language processing tasks especially when used to deal with unstructured and semi-structured data. This work focuses on automatically generating regular expressions and proposes a novel…

Neural and Evolutionary Computing · Computer Science 2020-06-25 Desheng Wang , Jiawei Liu , Xiang Qi , Baolin Sun , Peng Zhang

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea…

Machine Learning · Computer Science 2026-05-26 Jiale Fu , Yuchu Jiang , Peijun Wu , Chonghan Liu , Joey Tianyi Zhou , Xu Yang

Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence)…

Computation and Language · Computer Science 2021-10-07 Xinying Song , Alex Salcianu , Yang Song , Dave Dopson , Denny Zhou
‹ Prev 1 3 4 5 6 7 10 Next ›