Related papers: Strong Model Collapse

A Tale of Tails: Model Collapse as a Change of Scaling Laws

As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models…

Machine Learning · Computer Science 2024-06-03 Elvis Dohmatob , Yunzhen Feng , Pu Yang , Francois Charton , Julia Kempe

Escaping Collapse: The Strength of Weak Data for Large Language Model Training

Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance…

Machine Learning · Computer Science 2025-12-02 Kareem Amin , Sara Babakniya , Alex Bie , Weiwei Kong , Umar Syed , Sergei Vassilvitskii

A Solvable Model of Neural Scaling Laws

Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in…

Machine Learning · Computer Science 2022-11-01 Alexander Maloney , Daniel A. Roberts , James Sully

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language models, linking compute budget, dataset size, model size, and autoregressive modeling loss. The validity of these simple power laws across orders of magnitude in model…

Machine Learning · Statistics 2021-09-27 Amélie Chatelain , Amine Djeghri , Daniel Hesslow , Julien Launay , Iacopo Poli

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive…

Machine Learning · Computer Science 2024-10-28 Yunzhen Feng , Elvis Dohmatob , Pu Yang , Francois Charton , Julia Kempe

How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse

The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training…

Machine Learning · Computer Science 2024-04-09 Mohamed El Amine Seddik , Suei-Wen Chen , Soufiane Hayou , Pierre Youssef , Merouane Debbah

Model Collapse Demystified: The Case of Regression

In the era of proliferation of large language and image generation models, the phenomenon of "model collapse" refers to the situation whereby as a model is trained recursively on data generated from previous generations of itself over time,…

Machine Learning · Computer Science 2024-05-02 Elvis Dohmatob , Yunzhen Feng , Julia Kempe

Experimental evidence of progressive ChatGPT models self-convergence

Large Language Models (LLMs) that undergo recursive training on synthetically generated data are susceptible to model collapse, a phenomenon marked by the generation of meaningless output. Existing research has examined this issue from…

Computation and Language · Computer Science 2026-03-17 Konstantinos F. Xylogiannopoulos , Petros Xanthopoulos , Panagiotis Karampelas , Georgios A. Bakamitsos

A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops

High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming…

Machine Learning · Computer Science 2025-02-27 Shi Fu , Yingjie Wang , Yuzhu Chen , Xinmei Tian , Dacheng Tao

Scaling Laws in Linear Regression: Compute, Parameters, and Data

Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists…

Machine Learning · Computer Science 2025-06-11 Licong Lin , Jingfeng Wu , Sham M. Kakade , Peter L. Bartlett , Jason D. Lee

Scaling Laws of Synthetic Data for Language Models

Large language models (LLMs) achieve strong performance across diverse tasks, largely driven by high-quality web data used in pre-training. However, recent studies indicate this data source is rapidly depleting. Synthetic data emerges as a…

Computation and Language · Computer Science 2025-10-07 Zeyu Qin , Qingxiu Dong , Xingxing Zhang , Li Dong , Xiaolong Huang , Ziyi Yang , Mahmoud Khademi , Dongdong Zhang , Hany Hassan Awadalla , Yi R. Fung , Weizhu Chen , Minhao Cheng , Furu Wei

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance…

Machine Learning · Computer Science 2025-07-16 Zhengyu Chen , Siqi Wang , Teng Xiao , Yudong Wang , Shiqi Chen , Xunliang Cai , Junxian He , Jingang Wang

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven…

Machine Learning · Computer Science 2020-01-24 Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , Dario Amodei

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and…

Machine Learning · Computer Science 2026-05-21 Prasanna Mayilvahanan , Thaddäus Wiedemer , Sayak Mallick , Matthias Bethge , Wieland Brendel

Scaling Laws and Interpretability of Learning from Repeated Data

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the…

Machine Learning · Computer Science 2022-05-24 Danny Hernandez , Tom Brown , Tom Conerly , Nova DasSarma , Dawn Drain , Sheer El-Showk , Nelson Elhage , Zac Hatfield-Dodds , Tom Henighan , Tristan Hume , Scott Johnston , Ben Mann , Chris Olah , Catherine Olsson , Dario Amodei , Nicholas Joseph , Jared Kaplan , Sam McCandlish

Scaling Law for Language Models Training Considering Batch Size

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch…

Computation and Language · Computer Science 2024-12-03 Xian Shuai , Yiding Wang , Yimeng Wu , Xin Jiang , Xiaozhe Ren

Bias Amplification: Large Language Models as Increasingly Biased Media

Model collapse, a phenomenon characterized by performance degradation due to iterative training on synthetic data, has been widely studied. However, its implications for bias amplification, the progressive intensification of pre-existing…

Artificial Intelligence · Computer Science 2025-05-23 Ze Wang , Zekun Wu , Jeremy Zhang , Xin Guan , Navya Jain , Skylar Lu , Saloni Gupta , Adriano Koshiyama

A Probabilistic Perspective on Model Collapse

In recent years, model collapse has become a critical issue in language model training, making it essential to understand the underlying mechanisms driving this phenomenon. In this paper, we investigate recursive parametric model training…

Machine Learning · Statistics 2025-05-23 Shirong Xu , Hengzhi He , Guang Cheng

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Machine-generated text detection prevents language model collapse

As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since online data is the primary resource…

Computation and Language · Computer Science 2025-09-23 George Drayson , Emine Yilmaz , Vasileios Lampos