English
Related papers

Related papers: Data Mixture Optimization: A Multi-fidelity Multi-…

200 papers

Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no…

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…

Computation and Language · Computer Science 2024-12-10 Clara Na , Ian Magnusson , Ananya Harsh Jha , Tom Sherborne , Emma Strubell , Jesse Dodge , Pradeep Dasigi

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget…

Computation and Language · Computer Science 2026-04-21 Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses.…

Computation and Language · Computer Science 2024-03-20 Jiuhai Chen , Jonas Mueller

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or…

Computation and Language · Computer Science 2025-03-21 Jiasheng Ye , Peiju Liu , Tianxiang Sun , Jun Zhan , Yunhua Zhou , Xipeng Qiu

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on…

Machine Learning · Computer Science 2026-05-07 Jingwei Li , Xinran Gu , Jingzhao Zhang

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on…

Computation and Language · Computer Science 2026-02-20 Bettina Messmer , Vinko Sabolčec , Martin Jaggi

Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training…

Cryptography and Security · Computer Science 2024-06-03 Xiaoqun Liu , Jiacheng Liang , Muchao Ye , Zhaohan Xi

We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a…

Machine Learning · Computer Science 2025-02-25 Lior Belenki , Alekh Agarwal , Tianze Shi , Kristina Toutanova

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly…

Machine Learning · Computer Science 2025-01-28 Ce Ge , Zhijian Ma , Daoyuan Chen , Yaliang Li , Bolin Ding

This paper addresses the challenges of efficiently fine-tuning large language models (LLMs) by exploring data efficiency and hyperparameter optimization. We investigate the minimum data required for effective fine-tuning and propose a novel…

Computation and Language · Computer Science 2024-07-22 Michael Oliver , Guan Wang

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web…

Computation and Language · Computer Science 2023-09-12 Max Marion , Ahmet Üstün , Luiza Pozzobon , Alex Wang , Marzieh Fadaee , Sara Hooker

We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in…

Computation and Language · Computer Science 2025-06-03 Dongyue Li , Ziniu Zhang , Lu Wang , Hongyang R. Zhang

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present…

Computation and Language · Computer Science 2025-03-11 Tristan Thrush , Christopher Potts , Tatsunori Hashimoto

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable…

Computation and Language · Computer Science 2024-07-10 Nan He , Weichen Xiong , Hanwen Liu , Yi Liao , Lei Ding , Kai Zhang , Guohua Tang , Xiao Han , Wei Yang

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM…

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired…

Machine Learning · Computer Science 2024-05-07 Feiyang Kang , Hoang Anh Just , Yifan Sun , Himanshu Jahagirdar , Yuanzhi Zhang , Rongxing Du , Anit Kumar Sahu , Ruoxi Jia
‹ Prev 1 2 3 10 Next ›