English
Related papers

Related papers: Multilingual Language Model Pretraining using Mach…

200 papers

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a…

Computation and Language · Computer Science 2024-11-07 Jiayi Wang , Yao Lu , Maurice Weber , Max Ryabinin , Yihong Chen , Raphael Tang , Pontus Stenetorp

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on…

Computation and Language · Computer Science 2026-02-20 Bettina Messmer , Vinko Sabolčec , Martin Jaggi

The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and…

Computation and Language · Computer Science 2024-11-01 Guilherme Penedo , Hynek Kydlíček , Loubna Ben allal , Anton Lozhkov , Margaret Mitchell , Colin Raffel , Leandro Von Werra , Thomas Wolf

Pre-training state-of-the-art large language models (LLMs) requires vast amounts of clean and diverse text data. While the open development of large high-quality English pre-training datasets has seen substantial recent progress, training…

Most vision-and-language pretraining research focuses on English tasks. However, the creation of multilingual multimodal evaluation datasets (e.g. Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality training data…

Computation and Language · Computer Science 2022-10-25 Chen Qiu , Dan Oneata , Emanuele Bugliarello , Stella Frank , Desmond Elliott

Open-source large language models (LLMs) have gained significant strength across diverse fields. Nevertheless, the majority of studies primarily concentrate on English, with only limited exploration into the realm of multilingual abilities.…

Computation and Language · Computer Science 2024-02-20 Haoyu Wang , Shuo Wang , Yukun Yan , Xujia Wang , Zhiyu Yang , Yuzhuang Xu , Zhenghao Liu , Liner Yang , Ning Ding , Xu Han , Zhiyuan Liu , Maosong Sun

As large language models (LLMs) grow and develop, so do their data demands. This is especially true for multilingual LLMs, where the scarcity of high-quality and readily available data online has led to a multitude of synthetic dataset…

Computation and Language · Computer Science 2024-11-12 Sultan Alrashed , Dmitrii Khizbullin , David R. Pugh

Large language models (LLMs) demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English,…

Pretrained language models (PLMs) display impressive performances and have captured the attention of the NLP community. Establishing best practices in pretraining has, therefore, become a major focus of NLP research, especially since…

Computation and Language · Computer Science 2024-10-08 Zihao Li , Shaoxiong Ji , Timothee Mickus , Vincent Segonne , Jörg Tiedemann

Recent work demonstrates the potential of multilingual pretraining of creating one model that can be used for various tasks in different languages. Previous work in multilingual pretraining has demonstrated that machine translation systems…

Computation and Language · Computer Science 2020-08-04 Yuqing Tang , Chau Tran , Xian Li , Peng-Jen Chen , Naman Goyal , Vishrav Chaudhary , Jiatao Gu , Angela Fan

In this paper, we explore the utility of translationese as synthetic data created using machine translation for pre-training language models (LMs) for low-resource languages (LRLs). Our simple methodology consists of translating large…

Computation and Language · Computer Science 2025-07-08 Meet Doshi , Raj Dabre , Pushpak Bhattacharyya

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual…

Software Engineering · Computer Science 2024-10-15 Qingxiao Tao , Tingrui Yu , Xiaodong Gu , Beijun Shen

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language tasks, marking significant strides towards general artificial intelligence. While general artificial intelligence is leveraged by…

Computation and Language · Computer Science 2023-10-31 Yizhe Yang , Huashan Sun , Jiawei Li , Runheng Liu , Yinghao Li , Yuhang Liu , Heyan Huang , Yang Gao

General Large Language Models (LLMs) excel in reasoning, but those enhanced for translation struggle with reasoning tasks. To address this, we propose a novel translationenhanced recipe that begins with instruct models and applies…

Computation and Language · Computer Science 2025-10-13 Changjiang Gao , Zixian Huang , Jingyang Gong , Shujian Huang , Lei Li , Fei Yuan

High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly…

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high…

Computation and Language · Computer Science 2025-02-07 Skyler Seto , Maartje ter Hoeve , Richard He Bai , Natalie Schluter , David Grangier

Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as…

Computation and Language · Computer Science 2025-12-17 Jonas Golde , Patrick Haller , Alan Akbik

Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions.…

Computation and Language · Computer Science 2024-07-02 Wenhao Zhu , Shujian Huang , Fei Yuan , Shuaijie She , Jiajun Chen , Alexandra Birch

Recent studies have demonstrated the efficiency of generative pretraining for English natural language understanding. In this work, we extend this approach to multiple languages and show the effectiveness of cross-lingual pretraining. We…

Computation and Language · Computer Science 2019-01-23 Guillaume Lample , Alexis Conneau

As the performance of Large-scale Vision Language Models (LVLMs) improves, they are increasingly capable of responding in multiple languages, and there is an expectation that the demand for explanations generated by LVLMs will grow.…

Computation and Language · Computer Science 2025-02-17 Shintaro Ozaki , Kazuki Hayashi , Yusuke Sakai , Hidetaka Kamigaito , Katsuhiko Hayashi , Taro Watanabe
‹ Prev 1 2 3 10 Next ›