Related papers: Zero-Shot Tokenizer Transfer

Retrofitting Large Language Models with Dynamic Tokenization

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

Zero Token-Driven Deep Thinking in LLMs: Unlocking the Full Potential of Existing Parameters via Cyclic Refinement

Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches…

Computation and Language · Computer Science 2025-02-19 Guanghao Li , Wenhao Jiang , Li Shen , Ming Tang , Chun Yuan

Model-Aware Tokenizer Transfer

Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer…

Computation and Language · Computer Science 2026-05-12 Mykola Haltiuk , Aleksander Smywinski-Pohl

Efficiently Adapting Pretrained Language Models To New Languages

Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train…

Computation and Language · Computer Science 2023-12-18 Zoltan Csaki , Pian Pawakapan , Urmish Thakker , Qiantong Xu

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource NLP

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer…

Computation and Language · Computer Science 2024-08-09 François Remy , Pieter Delobelle , Hayastan Avetisyan , Alfiya Khabibullina , Miryam de Lhoneux , Thomas Demeester

MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer

Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available…

Computation and Language · Computer Science 2024-05-21 Haotian Ye , Yihong Liu , Chunlan Ma , Hinrich Schütze

Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents…

Computation and Language · Computer Science 2025-05-16 Shaurya Sharthak , Vinayak Pahalwan , Adithya Kamath , Adarsh Shirawalmath

WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is…

Computation and Language · Computer Science 2022-09-13 Benjamin Minixhofer , Fabian Paischer , Navid Rekabsaz

Tik-to-Tok: Translating Language Models One Token at a Time: An Embedding Initialization Strategy for Efficient Language Adaptation

Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting…

Computation and Language · Computer Science 2023-10-06 François Remy , Pieter Delobelle , Bettina Berendt , Kris Demuynck , Thomas Demeester

A Family of LLMs Liberated from Static Vocabularies

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable…

Computation and Language · Computer Science 2026-03-18 Aleph Alpha , : , Adnen Abdessaied , Artur Baranowski , Lukas Balles , Michael Barlow , Fabien C. Y. Benureau , Felix Berkenkamp , Lukas Bluebaum , Bastian Boll , Thomas F. Burns , Björn Deiseroth , Constantin Eichenberg , David Friede , Pablo Iyu Guerrero , Ahmed Hammam , Bastian Harren , Johann Higl , Yasser Jadidi , Carina Kauf , Johannes Messner , Jan Hendrik Metzen , Max Meuer , Vedant Nanda , Pit Neitemeier , Koen Oostermeijer , Letitia Parcalabescu , Markus Pernpointner , Felix Reinfurt , Dylan Rodriquez , Grégory Schott , Philipp Siedler , Martin Simonovsky , Till Speicher , Volker Stampa , Stephan Wäldchen , Samuel Weinbach , Gregor Ziegltrum

Language Models as Hierarchy Encoders

Interpreting hierarchical structures latent in language is a key limitation of current language models (LMs). While previous research has implicitly leveraged these hierarchies to enhance LMs, approaches for their explicit encoding are yet…

Computation and Language · Computer Science 2024-11-22 Yuan He , Zhangdie Yuan , Jiaoyan Chen , Ian Horrocks

Continual Learning in Multilingual NMT via Language-Specific Embeddings

This paper proposes a technique for adding a new source or target language to an existing multilingual NMT model without re-training it on the initial set of languages. It consists in replacing the shared vocabulary with a small…

Computation and Language · Computer Science 2021-10-22 Alexandre Berard

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

This paper investigates the problem of learning cross-lingual representations in a contextual space. We propose Cross-Lingual BERT Transformation (CLBT), a simple and efficient approach to generate cross-lingual contextualized word…

Computation and Language · Computer Science 2019-09-17 Yuxuan Wang , Wanxiang Che , Jiang Guo , Yijia Liu , Ting Liu

Zero-Shot Cross-lingual Semantic Parsing

Recent work in cross-lingual semantic parsing has successfully applied machine translation to localize parsers to new languages. However, these advances assume access to high-quality machine translation systems and word alignment tools. We…

Computation and Language · Computer Science 2022-03-08 Tom Sherborne , Mirella Lapata

LLM Based Long Code Translation using Identifier Replacement

In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often…

Software Engineering · Computer Science 2025-11-03 Manojit Chakraborty , Madhusudan Ghosh , Rishabh Gupta

T-FREE: Subword Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and…

Computation and Language · Computer Science 2025-01-08 Björn Deiseroth , Manuel Brack , Patrick Schramowski , Kristian Kersting , Samuel Weinbach

Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token…

Computation and Language · Computer Science 2025-06-10 Charles Goddard , Fernando Fernandes Neto

From Zero to Hero: On the Limitations of Zero-Shot Cross-Lingual Transfer with Multilingual Transformers

Massively multilingual transformers pretrained with language modeling objectives (e.g., mBERT, XLM-R) have become a de facto default transfer paradigm for zero-shot cross-lingual transfer in NLP, offering unmatched transfer performance.…

Computation and Language · Computer Science 2020-05-05 Anne Lauscher , Vinit Ravishankar , Ivan Vulić , Goran Glavaš

Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer

Multilingual pre-trained contextual embedding models (Devlin et al., 2019) have achieved impressive performance on zero-shot cross-lingual transfer tasks. Finding the most effective fine-tuning strategy to fine-tune these models on…

Computation and Language · Computer Science 2021-07-22 Weijia Xu , Batool Haider , Jason Krone , Saab Mansour