Multilingual Language Model Pretraining using Machine-translated Data

Jiayi Wang; Yao Lu; Maurice Weber; Max Ryabinin; David Adelani; Yihong Chen; Raphael Tang; Pontus Stenetorp

Multilingual Language Model Pretraining using Machine-translated Data

Computation and Language 2025-02-20 v1

Authors: Jiayi Wang , Yao Lu , Maurice Weber , Max Ryabinin , David Adelani , Yihong Chen , Raphael Tang , Pontus Stenetorp

View on arXiv ↗ PDF ↗

Abstract

High-resource languages such as English, enables the pretraining of high-quality large language models (LLMs). The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.

Keywords

cross-lingual transfer instruction tuning pre-trained language model

Cite

@article{arxiv.2502.13252,
  title  = {Multilingual Language Model Pretraining using Machine-translated Data},
  author = {Jiayi Wang and Yao Lu and Maurice Weber and Max Ryabinin and David Adelani and Yihong Chen and Raphael Tang and Pontus Stenetorp},
  journal= {arXiv preprint arXiv:2502.13252},
  year   = {2025}
}

Multilingual Language Model Pretraining using Machine-translated Data

Abstract

Keywords

Cite

Related papers