English
Related papers

Related papers: Zero-Shot Tokenizer Transfer

200 papers

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

Resource limitations often constrain the parameter counts of Large Language Models (LLMs), hindering their performance. While existing methods employ parameter sharing to reuse the same parameter set under fixed budgets, such approaches…

Computation and Language · Computer Science 2025-02-19 Guanghao Li , Wenhao Jiang , Li Shen , Ming Tang , Chun Yuan

Large Language Models (LLMs) are trained to support an increasing number of languages, yet their predefined tokenizers remain a bottleneck for adapting models to lower-resource or distinct-script languages. Existing tokenizer transfer…

Computation and Language · Computer Science 2026-05-12 Mykola Haltiuk , Aleksander Smywinski-Pohl

Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train…

Computation and Language · Computer Science 2023-12-18 Zoltan Csaki , Pian Pawakapan , Urmish Thakker , Qiantong Xu

The development of monolingual language models for low and mid-resource languages continues to be hindered by the difficulty in sourcing high-quality training data. In this study, we present a novel cross-lingual vocabulary transfer…

Computation and Language · Computer Science 2024-08-09 François Remy , Pieter Delobelle , Hayastan Avetisyan , Alfiya Khabibullina , Miryam de Lhoneux , Thomas Demeester

Transformer-based pre-trained language models (PLMs) have achieved remarkable performance in various natural language processing (NLP) tasks. However, pre-training such models can take considerable resources that are almost only available…

Computation and Language · Computer Science 2024-05-21 Haotian Ye , Yihong Liu , Chunlan Ma , Hinrich Schütze

Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents…

Computation and Language · Computer Science 2025-05-16 Shaurya Sharthak , Vinayak Pahalwan , Adithya Kamath , Adarsh Shirawalmath

Large pretrained language models (LMs) have become the central building block of many NLP applications. Training these models requires ever more computational resources and most of the existing models are trained on English text only. It is…

Computation and Language · Computer Science 2022-09-13 Benjamin Minixhofer , Fabian Paischer , Navid Rekabsaz

Training monolingual language models for low and mid-resource languages is made challenging by limited and often inadequate pretraining data. In this study, we propose a novel model conversion strategy to address this issue, adapting…

Computation and Language · Computer Science 2023-10-06 François Remy , Pieter Delobelle , Bettina Berendt , Kris Demuynck , Thomas Demeester

Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable…

Interpreting hierarchical structures latent in language is a key limitation of current language models (LMs). While previous research has implicitly leveraged these hierarchies to enhance LMs, approaches for their explicit encoding are yet…

Computation and Language · Computer Science 2024-11-22 Yuan He , Zhangdie Yuan , Jiaoyan Chen , Ian Horrocks

This paper proposes a technique for adding a new source or target language to an existing multilingual NMT model without re-training it on the initial set of languages. It consists in replacing the shared vocabulary with a small…

Computation and Language · Computer Science 2021-10-22 Alexandre Berard

This paper investigates the problem of learning cross-lingual representations in a contextual space. We propose Cross-Lingual BERT Transformation (CLBT), a simple and efficient approach to generate cross-lingual contextualized word…

Computation and Language · Computer Science 2019-09-17 Yuxuan Wang , Wanxiang Che , Jiang Guo , Yijia Liu , Ting Liu

Recent work in cross-lingual semantic parsing has successfully applied machine translation to localize parsers to new languages. However, these advances assume access to high-quality machine translation systems and word alignment tools. We…

Computation and Language · Computer Science 2022-03-08 Tom Sherborne , Mirella Lapata

In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often…

Software Engineering · Computer Science 2025-11-03 Manojit Chakraborty , Madhusudan Ghosh , Rishabh Gupta

Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and…

Computation and Language · Computer Science 2025-01-08 Björn Deiseroth , Manuel Brack , Patrick Schramowski , Kristian Kersting , Samuel Weinbach

Tokenization is associated with many poorly understood shortcomings in language models (LMs), yet remains an important component for long sequence scaling purposes. This work studies how tokenization impacts model performance by analyzing…

Computation and Language · Computer Science 2025-04-15 Buu Phan , Brandon Amos , Itai Gat , Marton Havasi , Matthew Muckley , Karen Ullrich

We present a training-free method to transplant tokenizers in pretrained large language models (LLMs) by reconstructing unseen token embeddings via Orthogonal Matching Pursuit (OMP). Specifically, we approximate each out-of-vocabulary token…

Computation and Language · Computer Science 2025-06-10 Charles Goddard , Fernando Fernandes Neto

Massively multilingual transformers pretrained with language modeling objectives (e.g., mBERT, XLM-R) have become a de facto default transfer paradigm for zero-shot cross-lingual transfer in NLP, offering unmatched transfer performance.…

Computation and Language · Computer Science 2020-05-05 Anne Lauscher , Vinit Ravishankar , Ivan Vulić , Goran Glavaš

Multilingual pre-trained contextual embedding models (Devlin et al., 2019) have achieved impressive performance on zero-shot cross-lingual transfer tasks. Finding the most effective fine-tuning strategy to fine-tune these models on…

Computation and Language · Computer Science 2021-07-22 Weijia Xu , Batool Haider , Jason Krone , Saab Mansour
‹ Prev 1 2 3 10 Next ›