English
Related papers

Related papers: Improving Pretraining Data Using Perplexity Correl…

200 papers

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web…

Computation and Language · Computer Science 2023-09-12 Max Marion , Ahmet Üstün , Luiza Pozzobon , Alex Wang , Marzieh Fadaee , Sara Hooker

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking…

Computation and Language · Computer Science 2025-04-09 Ru Peng , Kexin Yang , Yawen Zeng , Junyang Lin , Dayiheng Liu , Junbo Zhao

Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive…

Computation and Language · Computer Science 2024-12-10 Clara Na , Ian Magnusson , Ananya Harsh Jha , Tom Sherborne , Emma Strubell , Jesse Dodge , Pradeep Dasigi

Large language models (LLMs) rely on pretraining on massive and heterogeneous corpora, where training data composition has a decisive impact on training efficiency and downstream generalization under realistic compute and data budget…

Computation and Language · Computer Science 2026-04-21 Zhuo Chen , Yuxuan Miao , Supryadi , Deyi Xiong

In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the…

Machine Learning · Computer Science 2024-06-03 Zachary Ankner , Cody Blakeney , Kartik Sreenivasan , Max Marion , Matthew L. Leavitt , Mansheej Paul

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data…

Fine-tuning large language models (LLMs) typically relies on producing large sets of input-output pairs. Yet for a given question, there can be many valid outputs. In practice, these outputs are often derived by distilling knowledge from…

Computation and Language · Computer Science 2025-08-28 Xuan Ren , Qi Chen , Lingqiao Liu

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Most studies on language model pretraining focus on large datasets, leaving open questions about optimization in data-constrained settings. In such settings, the effects of training data order and of including alternative versions of the…

Computation and Language · Computer Science 2025-09-30 Matthew Theodore Roque , Dan John Velasco

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine…

While metrics available during pre-training, such as perplexity, correlate well with model performance at scaling-laws studies, their predictive capacities at a fixed model size remain unclear, hindering effective model selection and…

Computation and Language · Computer Science 2025-10-17 Hansi Zeng , Kai Hui , Honglei Zhuang , Zhen Qin , Zhenrui Yue , Hamed Zamani , Dana Alon

Continual Pre-training (CPT) serves as a fundamental approach for adapting foundation models to domain-specific applications. Scaling laws for pre-training define a power-law relationship between dataset size and the test loss of an LLM.…

Machine Learning · Computer Science 2025-12-29 Lei Liu , Hao Zhu , Yue Shen , Zhixuan Chu , Jian Wang , Jinjie Gu , Kui Ren

As large language models (LLMs) are pretrained on massive web corpora, careful selection of data becomes essential to ensure effective and efficient learning. While perplexity (PPL)-based filtering has shown strong performance, it suffers…

Computation and Language · Computer Science 2026-03-04 Yeongbin Seo , Gayoung Kim , Jaehyung Kim , Jinyoung Yeo

Due to the scarcity of high-quality data, large language models (LLMs) are often trained on mixtures of data with varying quality levels, even after sophisticated data curation. A natural approach to better leverage high-quality data is…

Machine Learning · Computer Science 2026-05-15 Kairong Luo , Zhenbo Sun , Haodong Wen , Xinyu Shi , Jiarui Cui , Chenyi Dang , Kaifeng Lyu , Wenguang Chen

As Large Language Models (LLMs) become increasingly widespread, understanding how specific training data shapes their outputs is crucial for transparency, accountability, privacy, and fairness. To explore how LLMs leverage and replicate…

Computation and Language · Computer Science 2025-07-03 Arthur Wuhrmann , Anastasiia Kucherenko , Andrei Kucharavy

Recently published work on rephrasing natural text data for pre-training LLMs has shown promising results when combining the original dataset with the synthetically rephrased data. We build upon previous work by replicating existing results…

Long-context modeling capabilities are important for large language models (LLMs) in various applications. However, directly training LLMs with long context windows is insufficient to enhance this capability since some training samples do…

Computation and Language · Computer Science 2024-05-29 Longze Chen , Ziqiang Liu , Wanwei He , Yunshui Li , Run Luo , Min Yang

Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient…

Computation and Language · Computer Science 2025-08-05 Kashun Shum , Yuzhen Huang , Hongjian Zou , Qi Ding , Yixuan Liao , Xiaoxin Chen , Qian Liu , Junxian He

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly…

Machine Learning · Computer Science 2025-01-28 Ce Ge , Zhijian Ma , Daoyuan Chen , Yaliang Li , Bolin Ding

In the last decade, the generalization and adaptation abilities of deep learning models were typically evaluated on fixed training and test distributions. Contrary to traditional deep learning, large language models (LLMs) are (i) even more…

Computation and Language · Computer Science 2024-10-17 Fırat Öncel , Matthias Bethge , Beyza Ermis , Mirco Ravanelli , Cem Subakan , Çağatay Yıldız
‹ Prev 1 2 3 10 Next ›