Related papers: Scaling Data-Constrained Language Models

Scaling Parameter-Constrained Language Models with Quality Data

Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-optimal estimates but often neglecting the impact of data quality on model generalization. In this…

Computation and Language · Computer Science 2024-10-07 Ernie Chang , Matteo Paltenghi , Yang Li , Pin-Jie Lin , Changsheng Zhao , Patrick Huber , Zechun Liu , Rastislav Rabatin , Yangyang Shi , Vikas Chandra

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

Scaling Laws for Mixture Pretraining Under Data Constraints

As language models scale, the amount of data they require grows -- yet many target data sources, such as low-resource languages or specialized domains, are inherently limited in size. A common strategy is to mix this scarce but valuable…

Machine Learning · Computer Science 2026-05-18 Anastasiia Sedova , Skyler Seto , Natalie Schluter , Pierre Ablin

Scaling Law for Language Models Training Considering Batch Size

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch…

Computation and Language · Computer Science 2024-12-03 Xian Shuai , Yiding Wang , Yimeng Wu , Xin Jiang , Xiaozhe Ren

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven…

Machine Learning · Computer Science 2020-01-24 Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , Dario Amodei

Is the Number of Trainable Parameters All That Actually Matters?

Recent work has identified simple empirical scaling laws for language models, linking compute budget, dataset size, model size, and autoregressive modeling loss. The validity of these simple power laws across orders of magnitude in model…

Machine Learning · Statistics 2021-09-27 Amélie Chatelain , Amine Djeghri , Daniel Hesslow , Julien Launay , Iacopo Poli

Training Compute-Optimal Protein Language Models

We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains…

Machine Learning · Computer Science 2024-11-05 Xingyi Cheng , Bo Chen , Pan Li , Jing Gong , Jie Tang , Le Song

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its…

Machine Learning · Computer Science 2023-10-10 Fuzhao Xue , Yao Fu , Wangchunshu Zhou , Zangwei Zheng , Yang You

Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation

Molecular generative models, often employing GPT-style language modeling on molecular string representations, have shown promising capabilities when scaled to large datasets and model sizes. However, it remains unclear and subject to debate…

Machine Learning · Computer Science 2026-02-02 Dong Xu , Qihua Pan , Sisi Yuan , Jianqiang Li , Zexuan Zhu , Junkai Ji

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled,…

Machine Learning · Computer Science 2026-04-13 Götz-Henrik Wiegand , Lorena Raichle , Rico Städeli , Tomas Hrycej , Bernhard Bermeitinger , Siegfried Handschuh

Scaling Laws and Interpretability of Learning from Repeated Data

Recent large language models have been trained on vast datasets, but also often on repeated data, either intentionally for the purpose of upweighting higher quality data, or unintentionally because data deduplication is not perfect and the…

Machine Learning · Computer Science 2022-05-24 Danny Hernandez , Tom Brown , Tom Conerly , Nova DasSarma , Dawn Drain , Sheer El-Showk , Nelson Elhage , Zac Hatfield-Dodds , Tom Henighan , Tristan Hume , Scott Johnston , Ben Mann , Chris Olah , Catherine Olsson , Dario Amodei , Nicholas Joseph , Jared Kaplan , Sam McCandlish

Scaling Laws for Generative Mixed-Modal Language Models

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for…

Computation and Language · Computer Science 2023-01-11 Armen Aghajanyan , Lili Yu , Alexis Conneau , Wei-Ning Hsu , Karen Hambardzumyan , Susan Zhang , Stephen Roller , Naman Goyal , Omer Levy , Luke Zettlemoyer

Compute Optimal Tokenization

Scaling laws enable the optimal selection of data amount and language model size, yet the impact of the data unit, the token, on this relationship remains underexplored. In this work, we systematically investigate how the information…

Computation and Language · Computer Science 2026-05-27 Tomasz Limisiewicz , Artidoro Pagnoni , Srini Iyer , Mike Lewis , Sachin Mehta , Alisa Liu , Margaret Li , Gargi Ghosh , Luke Zettlemoyer

Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo

As we scale to more massive machine learning models, the frequent synchronization demands inherent in data-parallel approaches create significant slowdowns, posing a critical challenge to further scaling. Recent work develops an approach…

Machine Learning · Computer Science 2025-03-14 Zachary Charles , Gabriel Teston , Lucio Dery , Keith Rush , Nova Fallen , Zachary Garrett , Arthur Szlam , Arthur Douillard

A Scaling Law for Token Efficiency in LLM Fine-Tuning Under Fixed Compute Budgets

We introduce a scaling law for fine-tuning large language models (LLMs) under fixed compute budgets that explicitly accounts for data composition. Conventional approaches measure training data solely by total tokens, yet the number of…

Computation and Language · Computer Science 2025-06-04 Ryan Lagasse , Aidan Kierans , Avijit Ghosh , Shiri Dori-Hacohen

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance…

Machine Learning · Computer Science 2025-07-16 Zhengyu Chen , Siqi Wang , Teng Xiao , Yudong Wang , Shiqi Chen , Xunliang Cai , Junxian He , Jingang Wang

Datasets, Documents, and Repetitions: The Practicalities of Unequal Data Quality

Data filtering has become a powerful tool for improving model performance while reducing computational cost. However, as large language model compute budgets continue to grow, the limited data volume provided by heavily filtered and…

Computation and Language · Computer Science 2025-11-07 Alex Fang , Hadi Pouransari , Matt Jordan , Alexander Toshev , Vaishaal Shankar , Ludwig Schmidt , Tom Gunter

A Solvable Model of Neural Scaling Laws

Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in…

Machine Learning · Computer Science 2022-11-01 Alexander Maloney , Daniel A. Roberts , James Sully

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

We provide an empirical investigation of the potential of pre-training vision-language models on an unprecedented scale: 100 billion examples. We find that model performance tends to saturate at this scale on many common Western-centric…

Computer Vision and Pattern Recognition · Computer Science 2025-02-12 Xiao Wang , Ibrahim Alabdulmohsin , Daniel Salz , Zhe Li , Keran Rong , Xiaohua Zhai

Reusing Overtrained Language Models Saturates Scaling

Reusing pretrained base models for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, the effectiveness remains unclear, especially when…

Computation and Language · Computer Science 2026-02-04 Seng Pei Liew , Takuya Kato