Related papers: Improving Pretraining Data Using Perplexity Correl…

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web…

Computation and Language · Computer Science 2023-09-12 Max Marion , Ahmet Üstün , Luiza Pozzobon , Alex Wang , Marzieh Fadaee , Sara Hooker

DataMan: Data Manager for Pre-training Large Language Models

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking…

Computation and Language · Computer Science 2025-04-09 Ru Peng , Kexin Yang , Yawen Zeng , Junyang Lin , Dayiheng Liu , Junbo Zhao

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…

Computation and Language · Computer Science 2024-12-10 Clara Na , Ian Magnusson , Ananya Harsh Jha , Tom Sherborne , Emma Strubell , Jesse Dodge , Pradeep Dasigi

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget…

Computation and Language · Computer Science 2026-04-21 Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the…

Machine Learning · Computer Science 2024-06-03 Zachary Ankner , Cody Blakeney , Kartik Sreenivasan , Max Marion , Matthew L. Leavitt , Mansheej Paul

How to Train Data-Efficient LLMs

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data…

Machine Learning · Computer Science 2024-02-16 Noveen Sachdeva , Benjamin Coleman , Wang-Cheng Kang , Jianmo Ni , Lichan Hong , Ed H. Chi , James Caverlee , Julian McAuley , Derek Zhiyuan Cheng

Efficient Response Generation Strategy Selection for Fine-Tuning Large Language Models Through Self-Aligned Perplexity

Fine-tuning large language models (LLMs) typically relies on producing large sets of input-output pairs. Yet for a given question, there can be many valid outputs. In practice, these outputs are often derived by distilling knowledge from…

Computation and Language · Computer Science 2025-08-28 Xuan Ren , Qi Chen , Lingqiao Liu

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Beyond Repetition: Text Simplification and Curriculum Learning for Data-Constrained Pretraining

Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the…

Computation and Language · Computer Science 2025-09-30 Matthew Theodore Roque , Dan John Velasco

Language Models Improve When Pretraining Data Matches Target Tasks

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

Computation and Language · Computer Science 2025-07-17 David Mizrahi , Anders Boesen Lindbo Larsen , Jesse Allardice , Suzie Petryk , Yuri Gorokhov , Jeffrey Li , Alex Fang , Josh Gardner , Tom Gunter , Afshin Dehghan

Can Pre-training Indicators Reliably Predict Fine-tuning Outcomes of LLMs?

While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and…

Computation and Language · Computer Science 2025-10-17 Hansi Zeng , Kai Hui , Honglei Zhuang , Zhen Qin , Zhenrui Yue , Hamed Zamani , Dana Alon

Perplexity-Aware Data Scaling Law: Perplexity Landscapes Predict Performance for Continual Pre-training

Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.…

Machine Learning · Computer Science 2025-12-29 Lei Liu , Hao Zhu , Yue Shen , Zhixuan Chu , Jian Wang , Jinjie Gu , Kui Ren

Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers…

Computation and Language · Computer Science 2026-03-04 Yeongbin Seo , Gayoung Kim , Jaehyung Kim , Jinyoung Yeo

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is…

Machine Learning · Computer Science 2026-05-15 Kairong Luo , Zhenbo Sun , Haodong Wen , Xinyu Shi , Jiarui Cui , Chenyi Dang , Kaifeng Lyu , Wenguang Chen

Low-Perplexity LLM-Generated Sequences and Where To Find Them

As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate…

Computation and Language · Computer Science 2025-07-03 Arthur Wuhrmann , Anastasiia Kucherenko , Andrei Kucharavy

Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results…

Computation and Language · Computer Science 2024-10-29 Michael Pieler , Marco Bellagente , Hannah Teufel , Duy Phung , Nathan Cooper , Jonathan Tow , Paulo Rocha , Reshinth Adithyan , Zaid Alyafeai , Nikhil Pinnaparaju , Maksym Zhuravinskyi , Carlos Riquelme

Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models

Long-context modeling capabilities are important for large language models (LLMs) in various applications. However, directly training LLMs with long context windows is insufficient to enhance this capability since some training samples do…

Computation and Language · Computer Science 2024-05-29 Longze Chen , Ziqiang Liu , Wanwei He , Yunshui Li , Run Luo , Min Yang

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient…

Computation and Language · Computer Science 2025-08-05 Kashun Shum , Yuzhen Huang , Hongjian Zou , Qi Ding , Yixuan Liao , Xiaoxin Chen , Qian Liu , Junxian He

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly…

Machine Learning · Computer Science 2025-01-28 Ce Ge , Zhijian Ma , Daoyuan Chen , Yaliang Li , Bolin Ding

Adaptation Odyssey in LLMs: Why Does Additional Pretraining Sometimes Fail to Improve?

In the last decade, the generalization and adaptation abilities of deep learning models were typically evaluated on fixed training and test distributions. Contrary to traditional deep learning, large language models (LLMs) are (i) even more…

Computation and Language · Computer Science 2024-10-17 Fırat Öncel , Matthias Bethge , Beyza Ermis , Mirco Ravanelli , Cem Subakan , Çağatay Yıldız