Related papers: Efficient Transformers with Dynamic Token Pooling

Retrofitting Large Language Models with Dynamic Tokenization

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Pool Me Wisely: On the Effect of Pooling in Transformer-Based Models

Transformer models have become the dominant backbone for sequence modeling, leveraging self-attention to produce contextualized token representations. These are typically aggregated into fixed-size vectors via pooling operations for…

Machine Learning · Computer Science 2025-10-07 Sofiane Ennadir , Levente Zólyomi , Oleg Smirnov , Tianze Wang , John Pertoft , Filip Cornell , Lele Cao

Contextual Morphogenesis in Large Language Models: A Novel Approach to Self-Organizing Token Representations

Token representations influence the efficiency and adaptability of language models, yet conventional tokenization strategies impose rigid segmentation boundaries that do not adjust dynamically to evolving contextual relationships. The…

Computation and Language · Computer Science 2025-08-11 Alistair Dombrowski , Beatrix Engelhardt , Dimitri Fairbrother , Henry Evidail

Fine-Tuning Transformers: Vocabulary Transfer

Transformers are responsible for the vast majority of recent advances in natural language processing. The majority of practical natural language processing applications of these models are typically enabled through transfer learning. This…

Computation and Language · Computer Science 2024-02-02 Vladislav Mosin , Igor Samenko , Alexey Tikhonov , Borislav Kozlovskii , Ivan P. Yamshchikov

Learning to Merge Tokens in Vision Transformers

Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In…

Computer Vision and Pattern Recognition · Computer Science 2022-02-25 Cedric Renggli , André Susano Pinto , Neil Houlsby , Basil Mustafa , Joan Puigcerver , Carlos Riquelme

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar

Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs

To apply neural sequence models such as the Transformers to music generation tasks, one has to represent a piece of music by a sequence of tokens drawn from a finite set of pre-defined vocabulary. Such a vocabulary usually involves tokens…

Sound · Computer Science 2021-01-08 Wen-Yi Hsiao , Jen-Yu Liu , Yin-Cheng Yeh , Yi-Hsuan Yang

Lexical Manifold Reconfiguration in Large Language Models: A Novel Architectural Approach for Contextual Modulation

Contextual adaptation in token embeddings plays a central role in determining how well language models maintain coherence and retain semantic relationships over extended text sequences. Static embeddings often impose constraints on lexical…

Computation and Language · Computer Science 2025-03-27 Koinis Vassilis , Godfrey Milbourne , Harriet Featherstone , Xanthe Peverell , Yorick Bletchley , Zachary Montford

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

In Transformer architectures, tokens\textemdash discrete units derived from raw data\textemdash are formed by segmenting inputs into fixed-length chunks. Each token is then mapped to an embedding, enabling parallel attention computations…

Machine Learning · Computer Science 2026-01-14 Zhenglun Kong , Yize Li , Fanhu Zeng , Lei Xin , Shvat Messica , Xue Lin , Pu Zhao , Manolis Kellis , Hao Tang , Marinka Zitnik

ByteSpan: Information-Driven Subword Tokenisation

Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an…

Computation and Language · Computer Science 2025-06-24 Zébulon Goriely , Suchir Salhan , Pietro Lesci , Julius Cheng , Paula Buttery

Generation with Dynamic Vocabulary

We introduce a new dynamic vocabulary for language models. It can involve arbitrary text spans during generation. These text spans act as basic generation bricks, akin to tokens in the traditional static vocabularies. We show that, the…

Computation and Language · Computer Science 2024-10-14 Yanting Liu , Tao Ji , Changzhi Sun , Yuanbin Wu , Xiaoling Wang

Sparsifying Transformer Models with Trainable Representation Pooling

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of…

Computation and Language · Computer Science 2022-03-08 Michał Pietruszka , Łukasz Borchmann , Łukasz Garncarek

Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs.…

Computation and Language · Computer Science 2026-04-09 Bajian Xiang , Tingwei Guo , Xuan Chen , Yang Han

Vocabulary Customization for Efficient Domain-Specific LLM Deployment

When using an LLM to process text outside the training domain(s), an often overlooked factor is vocabulary mismatch, where the general-domain tokenizer fails to capture frequent domain-specific terms, leading to higher token fertility and…

Computation and Language · Computer Science 2025-10-01 Christian Herold , Michael Kozielski , Nicholas Santavas , Yannick Versley , Shahram Khadivi

Efficient Representation Learning via Adaptive Context Pooling

Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens (e.g., text characters or image pixels), which…

Machine Learning · Computer Science 2022-07-06 Chen Huang , Walter Talbott , Navdeep Jaitly , Josh Susskind

MANTa: Efficient Gradient-Based Tokenization for Robust End-to-End Language Modeling

Static subword tokenization algorithms have been an essential component of recent works on language modeling. However, their static nature results in important flaws that degrade the models' downstream performance and robustness. In this…

Computation and Language · Computer Science 2022-12-15 Nathan Godey , Roman Castagné , Éric de la Clergerie , Benoît Sagot

Dynamic Evaluation of Transformer Language Models

This research note combines two methods that have recently improved the state of the art in language modeling: Transformers and dynamic evaluation. Transformers use stacked layers of self-attention that allow them to capture long range…

Machine Learning · Computer Science 2019-04-18 Ben Krause , Emmanuel Kahembwe , Iain Murray , Steve Renals

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Transformer-based models have achieved great success in various NLP, vision, and speech tasks. However, the core of Transformer, the self-attention mechanism, has a quadratic time and memory complexity with respect to the sequence length,…

Computation and Language · Computer Science 2023-05-23 Chao-Hong Tan , Qian Chen , Wen Wang , Qinglin Zhang , Siqi Zheng , Zhen-Hua Ling

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples…

Computation and Language · Computer Science 2025-05-26 Hongzhi Huang , Defa Zhu , Banggu Wu , Yutao Zeng , Ya Wang , Qiyang Min , Xun Zhou