English
Related papers

Related papers: Batching BPE Tokenization Merges

200 papers

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Wanpeng Zhang , Yicheng Feng , Hao Luo , Yijiang Li , Zihao Yue , Sipeng Zheng , Zongqing Lu

For different language pairs, word-level neural machine translation (NMT) models with a fixed-size vocabulary suffer from the same problem of representing out-of-vocabulary (OOV) words. The common practice usually replaces all these rare or…

Computation and Language · Computer Science 2018-07-26 Yingting Wu , Hai Zhao

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out…

Computation and Language · Computer Science 2020-08-04 Xuanli He , Gholamreza Haffari , Mohammad Norouzi

Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently…

Genomics · Quantitative Biology 2025-11-20 Siyuan Li , Kai Yu , Anna Wang , Zicheng Liu , Chang Yu , Jingbo Zhou , Qirong Yang , Yucheng Guo , Xiaoming Zhang , Stan Z. Li

Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned,…

Computation and Language · Computer Science 2024-08-13 Langlin Huang , Yang Feng

Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In…

Computer Vision and Pattern Recognition · Computer Science 2022-02-25 Cedric Renggli , André Susano Pinto , Neil Houlsby , Basil Mustafa , Joan Puigcerver , Carlos Riquelme

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the…

Computation and Language · Computer Science 2021-05-27 Tatsuya Hiraoka , Sho Takase , Kei Uchiumi , Atsushi Keyaki , Naoaki Okazaki

Traditionally, NLP performance improvement has been focused on improving models and increasing the number of model parameters. NLP vocabulary construction has remained focused on maximizing the number of words represented through subword…

Computation and Language · Computer Science 2023-04-26 Sandeep Mehta , Darpan Shah , Ravindra Kulkarni , Cornelia Caragea

We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed…

Computation and Language · Computer Science 2026-05-28 Craig W. Schmidt , Michael Krumdick , Adam Wiemerslage , Seth Ebner , Varshini Reddy , Yuval Pinter , Chris Tanner

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

Phones and their context-dependent variants have been the standard modeling units for conventional speech recognition systems, while characters and subwords have demonstrated their effectiveness for end-to-end recognition systems. We…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-23 Weiran Wang , Guangsen Wang , Aadyot Bhatnagar , Yingbo Zhou , Caiming Xiong , Richard Socher

Tokenizers act as a bridge between human language and the latent space of language models, influencing how language is represented in these models. Due to the immense popularity of English-Centric Large Language Models (LLMs), efforts are…

Computation and Language · Computer Science 2025-01-22 Menan Velayuthan , Kengatharaiyer Sarveswaran

Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size…

Computation and Language · Computer Science 2025-11-20 Violeta Kastreva , Philip Whittington , Dennis Komm , Tiago Pimentel

As large language models move toward million-token context windows, CPU tokenizers become a major slowdown because they process text one step at a time while powerful GPUs sit unused. We built a GPU-based byte-level BPE tokenizer that…

Computation and Language · Computer Science 2026-03-04 Venu Gopal Kadamba , Kanishkha Jaisankar

We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we…

Computation and Language · Computer Science 2025-11-27 Dong Dong , Weijie Su

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the…

Software Engineering · Computer Science 2022-06-15 Daoguang Zan , Bei Chen , Dejian Yang , Zeqi Lin , Minsu Kim , Bei Guan , Yongji Wang , Weizhu Chen , Jian-Guang Lou

Code-mixing is a phenomenon of mixing words and phrases from two or more languages in a single utterance of speech and text. Due to the high linguistic diversity, code-mixing presents several challenges in evaluating standard natural…

Computation and Language · Computer Science 2021-07-27 Ayush Garg , Sammed S Kagi , Vivek Srivastava , Mayank Singh

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models…

Computation and Language · Computer Science 2025-08-12 Jonas F. Lotz , Hendra Setiawan , Stephan Peitz , Yova Kementchedjhieva

Effective training of today's large language models (LLMs) depends on large batches and long sequences for throughput and accuracy. To handle variable-length sequences on hardware accelerators, it is common practice to introduce padding…

Computation and Language · Computer Science 2022-10-07 Mario Michael Krell , Matej Kosec , Sergio P. Perez , Andrew Fitzgibbon

The widespread online communication in a modern multilingual world has provided opportunities to blend more than one language (aka code-mixed language) in a single utterance. This has resulted a formidable challenge for the computational…

Computation and Language · Computer Science 2024-05-01 Kartik Kartik , Sanjana Soni , Anoop Kunchukuttan , Tanmoy Chakraborty , Md Shad Akhtar
‹ Prev 1 4 5 6 7 8 10 Next ›