English
Related papers

Related papers: Scale Dependent Data Duplication

200 papers

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the…

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute…

Machine Learning · Computer Science 2025-01-03 Bradley Brown , Jordan Juravsky , Ryan Ehrlich , Ronald Clark , Quoc V. Le , Christopher Ré , Azalia Mirhoseini

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable…

Machine Learning · Computer Science 2026-05-18 Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin

We find that existing language modeling datasets contain many near-duplicate examples and long repetitive substrings. As a result, over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the…

Computation and Language · Computer Science 2022-03-28 Katherine Lee , Daphne Ippolito , Andrew Nystrom , Chiyuan Zhang , Douglas Eck , Chris Callison-Burch , Nicholas Carlini

As neural networks continue to grow in size but datasets might not, it is vital to understand how much performance improvement can be expected: is it more important to scale network size or data volume? Thus, neural network scaling laws,…

Machine Learning · Computer Science 2024-09-10 Akhilan Boopathy , Ila Fiete

We introduce a framework for optimizing domain-specific dataset construction in foundation model training. Specifically, we seek a cost-efficient way to estimate the quality of data sources (e.g. synthetically generated or filtered web…

The effectiveness of large language models (LLMs) is often hindered by duplicated data in their extensive pre-training datasets. Current approaches primarily focus on detecting and removing duplicates, which risks the loss of valuable…

Computation and Language · Computer Science 2024-07-10 Nan He , Weichen Xiong , Hanwen Liu , Yi Liao , Lei Ding , Kai Zhang , Guohua Tang , Xiao Han , Wei Yang

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in…

Computation and Language · Computer Science 2026-01-30 Berivan Isik , Natalia Ponomareva , Hussein Hazimeh , Dimitris Paparas , Sergei Vassilvitskii , Sanmi Koyejo

In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and…

Machine Learning · Computer Science 2025-06-06 Marianna Nezhurina , Tomer Porian , Giovanni Pucceti , Tommie Kerssies , Romain Beaumont , Mehdi Cherti , Jenia Jitsev

We study empirical scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. When we train increasingly large neural networks from-scratch on a fixed-size dataset, they eventually become data-limited…

Machine Learning · Computer Science 2021-02-03 Danny Hernandez , Jared Kaplan , Tom Henighan , Sam McCandlish

Reusing pretrained base models for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, the effectiveness remains unclear, especially when…

Computation and Language · Computer Science 2026-02-04 Seng Pei Liew , Takuya Kato

Large language models achieve high performance on many but not all downstream tasks. The interaction between pretraining data and task data is commonly assumed to determine this variance: a task with data that is more similar to a model's…

Computation and Language · Computer Science 2023-11-16 Gregory Yauney , Emily Reif , David Mimno

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its…

Machine Learning · Computer Science 2023-10-10 Fuzhao Xue , Yao Fu , Wangchunshu Zhou , Zangwei Zheng , Yang You

Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance…

Machine Learning · Computer Science 2025-07-16 Zhengyu Chen , Siqi Wang , Teng Xiao , Yudong Wang , Shiqi Chen , Xunliang Cai , Junxian He , Jingang Wang

Progress in machine learning has been driven in large part by massive increases in data. However, large web-scale datasets such as LAION are largely uncurated beyond searches for exact duplicates, potentially leaving much redundancy. Here,…

Machine Learning · Computer Science 2023-03-23 Amro Abbas , Kushal Tirumala , Dániel Simig , Surya Ganguli , Ari S. Morcos

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense…

Computation and Language · Computer Science 2023-10-10 Tian Jin , Nolan Clement , Xin Dong , Vaishnavh Nagarajan , Michael Carbin , Jonathan Ragan-Kelley , Gintare Karolina Dziugaite

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven…

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale…

‹ Prev 1 2 3 10 Next ›