Related papers: Scale Dependent Data Duplication

Scaling Laws and Interpretability of Learning from Repeated Data

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the…

Machine Learning · Computer Science 2022-05-24 Danny Hernandez , Tom Brown , Tom Conerly , Nova DasSarma , Dawn Drain , Sheer El-Showk , Nelson Elhage , Zac Hatfield-Dodds , Tom Henighan , Tristan Hume , Scott Johnston , Ben Mann , Chris Olah , Catherine Olsson , Dario Amodei , Nicholas Joseph , Jared Kaplan , Sam McCandlish

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute…

Machine Learning · Computer Science 2025-01-03 Bradley Brown , Jordan Juravsky , Ryan Ehrlich , Ronald Clark , Quoc V. Le , Christopher Ré , Azalia Mirhoseini

Scaling Laws for Mixture Pretraining Under Data Constraints

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable…

Machine Learning · Computer Science 2026-05-18 Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin

Deduplicating Training Data Makes Language Models Better

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the…

Computation and Language · Computer Science 2022-03-28 Katherine Lee , Daphne Ippolito , Andrew Nystrom , Chiyuan Zhang , Douglas Eck , Chris Callison-Burch , Nicholas Carlini

Unified Neural Network Scaling Laws and Scale-time Equivalence

As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws,…

Machine Learning · Computer Science 2024-09-10 Akhilan Boopathy , Ila Fiete

Using Scaling Laws for Data Source Utility Estimation in Domain-Specific Pre-Training

We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web…

Machine Learning · Computer Science 2025-07-31 Oleksiy Ostapenko , Charles Guille-Escuret , Luke Kumar , Max Tian , Denis Kocetkov , Gopeshh Subbaraj , Raymond Li , Joel Lamy-Poirier , Sebastien Paquet , Torsten Scholak

SoftDedup: an Efficient Data Reweighting Method for Speeding Up Language Model Pre-training

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable…

Computation and Language · Computer Science 2024-07-10 Nan He , Weichen Xiong , Hanwen Liu , Yi Liao , Lei Ding , Kai Zhang , Guohua Tang , Xiao Han , Wei Yang

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Scaling Laws for Downstream Task Performance of Large Language Models

Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in…

Computation and Language · Computer Science 2026-01-30 Berivan Isik , Natalia Ponomareva , Hussein Hazimeh , Dimitris Paparas , Sergei Vassilvitskii , Sanmi Koyejo

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and…

Machine Learning · Computer Science 2025-06-06 Marianna Nezhurina , Tomer Porian , Giovanni Pucceti , Tommie Kerssies , Romain Beaumont , Mehdi Cherti , Jenia Jitsev

Scaling Laws for Transfer

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited…

Machine Learning · Computer Science 2021-02-03 Danny Hernandez , Jared Kaplan , Tom Henighan , Sam McCandlish

Reusing Overtrained Language Models Saturates Scaling

Reusing pretrained base models for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, the effectiveness remains unclear, especially when…

Computation and Language · Computer Science 2026-02-04 Seng Pei Liew , Takuya Kato

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's…

Computation and Language · Computer Science 2023-11-16 Gregory Yauney , Emily Reif , David Mimno

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its…

Machine Learning · Computer Science 2023-10-10 Fuzhao Xue , Yao Fu , Wangchunshu Zhou , Zangwei Zheng , Yang You

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance…

Machine Learning · Computer Science 2025-07-16 Zhengyu Chen , Siqi Wang , Teng Xiao , Yudong Wang , Shiqi Chen , Xunliang Cai , Junxian He , Jingang Wang

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here,…

Machine Learning · Computer Science 2023-03-23 Amro Abbas , Kushal Tirumala , Dániel Simig , Surya Ganguli , Ari S. Morcos

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense…

Computation and Language · Computer Science 2023-10-10 Tian Jin , Nolan Clement , Xin Dong , Vaishnavh Nagarajan , Michael Carbin , Jonathan Ragan-Kelley , Gintare Karolina Dziugaite

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven…

Machine Learning · Computer Science 2020-01-24 Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , Dario Amodei

Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale…

Machine Learning · Computer Science 2025-10-03 Feiyang Kang , Newsha Ardalani , Michael Kuchnik , Youssef Emad , Mostafa Elhoushi , Shubhabrata Sengupta , Shang-Wen Li , Ramya Raghavendra , Ruoxi Jia , Carole-Jean Wu