Related papers: Data Mixture Optimization: A Multi-fidelity Multi-…

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Determining the optimal data mixture for large language model training remains a challenging problem with an outsized impact on performance. In practice, language model developers continue to rely on heuristic exploration since no…

Machine Learning · Statistics 2025-08-19 Shengzhuang Chen , Xu Ouyang , Michael Arthur Leopold Pearce , Thomas Hartvigsen , Jonathan Richard Schwarz

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…

Computation and Language · Computer Science 2024-12-10 Clara Na , Ian Magnusson , Ananya Harsh Jha , Tom Sherborne , Emma Strubell , Jesse Dodge , Pradeep Dasigi

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget…

Computation and Language · Computer Science 2026-04-21 Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

Automated Data Curation for Robust Language Model Fine-Tuning

Large Language Models have become the de facto approach to sequence-to-sequence text generation tasks, but for specialized tasks/domains, a pretrained LLM lacks specific capabilities to produce accurate or well-formatted responses.…

Computation and Language · Computer Science 2024-03-20 Jiuhai Chen , Jonas Mueller

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or…

Computation and Language · Computer Science 2025-03-21 Jiasheng Ye , Peiju Liu , Tianxiang Sun , Jun Zhan , Yunhua Zhou , Xipeng Qiu

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Capacity-Aware Mixture Law Enables Efficient LLM Data Optimization

A data mixture refers to how different data sources are combined to train large language models, and selecting an effective mixture is crucial for optimal downstream performance. Existing methods either conduct costly searches directly on…

Machine Learning · Computer Science 2026-05-07 Jingwei Li , Xinran Gu , Jingzhao Zhang

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on…

Computation and Language · Computer Science 2026-02-20 Bettina Messmer , Vinko Sabolčec , Martin Jaggi

Robustifying Safety-Aligned Large Language Models through Clean Data Curation

Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training…

Cryptography and Security · Computer Science 2024-06-03 Xiaoqun Liu , Jiacheng Liang , Muchao Ye , Zhaohan Xi

Optimizing Pre-Training Data Mixtures with Mixtures of Data Expert Models

We propose a method to optimize language model pre-training data mixtures through efficient approximation of the cross-entropy loss corresponding to each candidate mixture via a Mixture of Data Experts (MDE). We use this approximation as a…

Machine Learning · Computer Science 2025-02-25 Lior Belenki , Alekh Agarwal , Tianze Shi , Kristina Toutanova

Compute-Constrained Data Selection

Data selection can reduce the amount of training data needed to finetune LLMs; however, the efficacy of data selection scales directly with its compute. Motivated by the practical challenge of compute-constrained finetuning, we consider the…

Machine Learning · Computer Science 2025-04-09 Junjie Oscar Yin , Alexander M. Rush

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly…

Machine Learning · Computer Science 2025-01-28 Ce Ge , Zhijian Ma , Daoyuan Chen , Yaliang Li , Bolin Ding

Crafting Efficient Fine-Tuning Strategies for Large Language Models

This paper addresses the challenges of efficiently fine-tuning large language models (LLMs) by exploring data efficiency and hyperparameter optimization. We investigate the minimum data required for effective fine-tuning and propose a novel…

Computation and Language · Computer Science 2024-07-22 Michael Oliver , Guan Wang

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web…

Computation and Language · Computer Science 2023-09-12 Max Marion , Ahmet Üstün , Luiza Pozzobon , Alex Wang , Marzieh Fadaee , Sara Hooker

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in…

Computation and Language · Computer Science 2025-06-03 Dongyue Li , Ziniu Zhang , Lu Wang , Hongyang R. Zhang

DsDm: Model-Aware Dataset Selection with Datamodels

When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…

Machine Learning · Computer Science 2024-01-24 Logan Engstrom , Axel Feldmann , Aleksander Madry

Improving Pretraining Data Using Perplexity Correlations

Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present…

Computation and Language · Computer Science 2025-03-11 Tristan Thrush , Christopher Potts , Tatsunori Hashimoto

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable…

Computation and Language · Computer Science 2024-07-10 Nan He , Weichen Xiong , Hanwen Liu , Yi Liao , Lei Ding , Kai Zhang , Guohua Tang , Xiao Han , Wei Yang

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

Data curation has shifted the quality-compute frontier for language-model and contrastive image-text pretraining, but its role for vision-language models (VLMs) is far less established. We ask how far data curation alone can take VLM…

Machine Learning · Computer Science 2026-05-14 DatologyAI , : , Siddharth Joshi , Haoli Yin , Rishabh Adiga , Haakon Mongstad , Alvin Deng , Aldo Carranza , Alex Fang , Amro Abbas , Anshuman Suri , Brett Larsen , Daniel Zayas , Darren Teh , David Schwab , Diego Kiner , Fan Pan , Jack Urbanek , Jason Lee , Jason Telanoff , Josh Wills , Kaleigh Mentzer , Luke Merrick , Maximilian Böther , Parth Doshi , Paul Burstein , Pratyush Maini , Ties Robroek , Tony Jiang , Vidhi Jain , Vineeth Dorna , Zhengping Wang , Bogdan Gaza , Ari Morcos , Matthew Leavitt

Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs

This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-tune a pre-trained language model. The goal is to minimize the need for costly domain-specific data for subsequent fine-tuning while achieving desired…

Machine Learning · Computer Science 2024-05-07 Feiyang Kang , Hoang Anh Just , Yifan Sun , Himanshu Jahagirdar , Yuanzhi Zhang , Rongxing Du , Anit Kumar Sahu , Ruoxi Jia