Related papers: Multilingual Language Model Pretraining using Mach…

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a…

Computation and Language · Computer Science 2024-11-07 Jiayi Wang , Yao Lu , Maurice Weber , Max Ryabinin , Yihong Chen , Raphael Tang , Pontus Stenetorp

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on…

Computation and Language · Computer Science 2026-02-20 Bettina Messmer , Vinko Sabolčec , Martin Jaggi

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and…

Computation and Language · Computer Science 2024-11-01 Guilherme Penedo , Hynek Kydlíček , Loubna Ben allal , Anton Lozhkov , Margaret Mitchell , Colin Raffel , Leandro Von Werra , Thomas Wolf

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training…

Computation and Language · Computer Science 2025-06-27 Guilherme Penedo , Hynek Kydlíček , Vinko Sabolčec , Bettina Messmer , Negar Foroutan , Amir Hossein Kargaran , Colin Raffel , Martin Jaggi , Leandro Von Werra , Thomas Wolf

Multilingual Multimodal Learning with Machine Translated Text

Most vision-and-language pretraining research focuses on English tasks. However, the creation of multilingual multimodal evaluation datasets (e.g. Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality training data…

Computation and Language · Computer Science 2022-10-25 Chen Qiu , Dan Oneata , Emanuele Bugliarello , Stella Frank , Desmond Elliott

UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset

Open-source large language models (LLMs) have gained significant strength across diverse fields. Nevertheless, the majority of studies primarily concentrate on English, with only limited exploration into the realm of multilingual abilities.…

Computation and Language · Computer Science 2024-02-20 Haoyu Wang , Shuo Wang , Yukun Yan , Xujia Wang , Zhiyu Yang , Yuzhuang Xu , Zhenghao Liu , Liner Yang , Ning Ding , Xu Han , Zhiyuan Liu , Maosong Sun

Fineweb-Edu-Ar: Machine-translated Corpus to Support Arabic Small Language Models

As large language models (LLMs) grow and develop, so do their data demands. This is especially true for multilingual LLMs, where the scarcity of high-quality and readily available data online has led to a multitude of synthetic dataset…

Computation and Language · Computer Science 2024-11-12 Sultan Alrashed , Dmitrii Khizbullin , David R. Pugh

PolyLM: An Open Source Polyglot Large Language Model

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English,…

Computation and Language · Computer Science 2023-07-13 Xiangpeng Wei , Haoran Wei , Huan Lin , Tianhao Li , Pei Zhang , Xingzhang Ren , Mei Li , Yu Wan , Zhiwei Cao , Binbin Xie , Tianxiang Hu , Shangjie Li , Binyuan Hui , Bowen Yu , Dayiheng Liu , Baosong Yang , Fei Huang , Jun Xie

A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives

Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since…

Computation and Language · Computer Science 2024-10-08 Zihao Li , Shaoxiong Ji , Timothee Mickus , Vincent Segonne , Jörg Tiedemann

Multilingual Translation with Extensible Multilingual Pretraining and Finetuning

Recent work demonstrates the potential of multilingual pretraining of creating one model that can be used for various tasks in different languages. Previous work in multilingual pretraining has demonstrated that machine translation systems…

Computation and Language · Computer Science 2020-08-04 Yuqing Tang , Chau Tran , Xian Li , Peng-Jen Chen , Naman Goyal , Vishrav Chaudhary , Jiatao Gu , Angela Fan

Pretraining Language Models Using Translationese

In this paper, we explore the utility of translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large…

Computation and Language · Computer Science 2025-07-08 Meet Doshi , Raj Dabre , Pushpak Bhattacharyya

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual…

Software Engineering · Computer Science 2024-10-15 Qingxiao Tao , Tingrui Yu , Xiaodong Gu , Beijun Shen

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language tasks, marking significant strides towards general artificial intelligence. While general artificial intelligence is leveraged by…

Computation and Language · Computer Science 2023-10-31 Yizhe Yang , Huashan Sun , Jiawei Li , Runheng Liu , Yinghao Li , Yuhang Liu , Heyan Huang , Yang Gao

LLaMAX2: Your Translation-Enhanced Model also Performs Well in Reasoning

General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies…

Computation and Language · Computer Science 2025-10-13 Changjiang Gao , Zixian Huang , Jingyang Gong , Shujian Huang , Lei Li , Fei Yuan

Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly…

Computation and Language · Computer Science 2025-06-03 Mehdi Ali , Manuel Brack , Max Lübbering , Elias Wendt , Abbas Goher Khan , Richard Rutmann , Alex Jude , Maurice Kraus , Alexander Arno Weber , David Kaczér , Florian Mai , Lucie Flek , Rafet Sifa , Nicolas Flores-Herr , Joachim Köhler , Patrick Schramowski , Michael Fromm , Kristian Kersting

Training Bilingual LMs with Data Constraints in the Targeted Language

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high…

Computation and Language · Computer Science 2025-02-07 Skyler Seto , Maartje ter Hoeve , Richard He Bai , Natalie Schluter , David Grangier

FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as…

Computation and Language · Computer Science 2025-12-17 Jonas Golde , Patrick Haller , Alan Akbik

Question Translation Training for Better Multilingual Reasoning

Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions.…

Computation and Language · Computer Science 2024-07-02 Wenhao Zhu , Shujian Huang , Fei Yuan , Shuaijie She , Jiajun Chen , Alexandra Birch

Cross-lingual Language Model Pretraining

Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We…

Computation and Language · Computer Science 2019-01-23 Guillaume Lample , Alexis Conneau

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow.…

Computation and Language · Computer Science 2025-02-17 Shintaro Ozaki , Kazuki Hayashi , Yusuke Sakai , Hidetaka Kamigaito , Katsuhiko Hayashi , Taro Watanabe