Related papers: Optimization Hyper-parameter Laws for Large Langua…

Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well established, yet their effective deployment necessitates careful hyperparameter optimization. Although existing methods have explored the…

Machine Learning · Computer Science 2025-08-20 Houyi Li , Wenzhen Zheng , Qiufeng Wang , Hanshan Zhang , Zili Wang , Shijie Xuyang , Yuantao Fan , Zhenyu Ding , Haoying Wang , Ning Ding , Shuigeng Zhou , Xiangyu Zhang , Daxin Jiang

A Hitchhiker's Guide to Scaling Law Estimation

Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare…

Machine Learning · Computer Science 2025-06-04 Leshem Choshen , Yang Zhang , Jacob Andreas

Scaling Law for Language Models Training Considering Batch Size

Large language models (LLMs) have made remarkable advances in recent years, with scaling laws playing a critical role in this rapid progress. In this paper, we empirically investigate how a critical hyper-parameter, i.e., the global batch…

Computation and Language · Computer Science 2024-12-03 Xian Shuai , Yiding Wang , Yimeng Wu , Xin Jiang , Xiaozhe Ren

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance…

Machine Learning · Computer Science 2025-07-16 Zhengyu Chen , Siqi Wang , Teng Xiao , Yudong Wang , Shiqi Chen , Xunliang Cai , Junxian He , Jingang Wang

Unraveling the Mystery of Scaling Laws: Part I

Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. These principles play a vital role in optimizing various aspects of…

Machine Learning · Computer Science 2024-04-08 Hui Su , Zhi Tian , Xiaoyu Shen , Xunliang Cai

Language models scale reliably with over-training and on downstream tasks

Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are…

Computation and Language · Computer Science 2024-06-18 Samir Yitzhak Gadre , Georgios Smyrnis , Vaishaal Shankar , Suchin Gururangan , Mitchell Wortsman , Rulin Shao , Jean Mercat , Alex Fang , Jeffrey Li , Sedrick Keh , Rui Xin , Marianna Nezhurina , Igor Vasiljevic , Jenia Jitsev , Luca Soldaini , Alexandros G. Dimakis , Gabriel Ilharco , Pang Wei Koh , Shuran Song , Thomas Kollar , Yair Carmon , Achal Dave , Reinhard Heckel , Niklas Muennighoff , Ludwig Schmidt

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and…

Machine Learning · Computer Science 2026-05-21 Prasanna Mayilvahanan , Thaddäus Wiedemer , Sayak Mallick , Matthias Bethge , Wieland Brendel

Scaling Laws for Acoustic Models

There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-18 Jasha Droppo , Oguz Elibol

Temporal Scaling Law for Large Language Models

Recently, Large Language Models (LLMs) have been widely adopted in a wide range of tasks, leading to increasing attention towards the research on how scaling LLMs affects their performance. Existing works, termed Scaling Laws, have…

Computation and Language · Computer Science 2025-09-23 Yizhe Xiong , Xiansheng Chen , Xin Ye , Hui Chen , Zijia Lin , Haoran Lian , Zhenpeng Su , Wei Huang , Jianwei Niu , Jungong Han , Guiguang Ding

Towards Optimal Learning of Language Models

This work studies the general principles of improving the learning of language models (LMs), which aims at reducing the necessary training steps for achieving superior performance. Specifically, we present a theory for the optimal learning…

Computation and Language · Computer Science 2024-03-05 Yuxian Gu , Li Dong , Yaru Hao , Qingxiu Dong , Minlie Huang , Furu Wei

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often…

Machine Learning · Computer Science 2026-03-18 Egor Shulgin , Dimitri von Rütte , Tianyue H. Zhang , Niccolò Ajroldi , Bernhard Schölkopf , Antonio Orvieto

Scaling Laws for Neural Language Models

We study empirical scaling laws for language model performance on the cross-entropy loss. The loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven…

Machine Learning · Computer Science 2020-01-24 Jared Kaplan , Sam McCandlish , Tom Henighan , Tom B. Brown , Benjamin Chess , Rewon Child , Scott Gray , Alec Radford , Jeffrey Wu , Dario Amodei

Scaling Laws for Optimal Data Mixtures

Large foundation models are typically trained on data from multiple domains, with the data mixture--the proportion of each domain used--playing a critical role in model performance. The standard approach to selecting this mixture relies on…

Machine Learning · Computer Science 2025-10-03 Mustafa Shukor , Louis Bethune , Dan Busbridge , David Grangier , Enrico Fini , Alaaeldin El-Nouby , Pierre Ablin

Performance Law of Large Language Models

Guided by the belief of the scaling law, large language models (LLMs) have achieved impressive performance in recent years. However, scaling law only gives a qualitative estimation of loss, which is influenced by various factors such as…

Computation and Language · Computer Science 2024-09-16 Chuhan Wu , Ruiming Tang

Scaling Laws for Hyperparameter Optimization

Hyperparameter optimization is an important subfield of machine learning that focuses on tuning the hyperparameters of a chosen algorithm to achieve peak performance. Recently, there has been a stream of methods that tackle the issue of…

Machine Learning · Computer Science 2023-10-26 Arlind Kadra , Maciej Janowski , Martin Wistuba , Josif Grabocka

Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or…

Computation and Language · Computer Science 2025-03-21 Jiasheng Ye , Peiju Liu , Tianxiang Sun , Jun Zhan , Yunhua Zhou , Xipeng Qiu

How to Set the Learning Rate for Large-Scale Pre-training?

Optimal configuration of the learning rate (LR) is a fundamental yet formidable challenge in large-scale pre-training. Given the stringent trade-off between training costs and model performance, the pivotal question is whether the optimal…

Artificial Intelligence · Computer Science 2026-01-09 Yunhua Zhou , Shuhao Xing , Junhao Huang , Xipeng Qiu , Qipeng Guo

InfoLaw: Information Scaling Laws for Large Language Models with Quality-Weighted Mixture Data and Repetition

Upweighting high-quality data in LLM pretraining often improves performance, but in datalimited regimes, especially under overtraining, stronger upweighting increases repetition and can degrade performance. However, standard scaling laws do…

Computation and Language · Computer Science 2026-05-05 Fengze Liu , Weidong Zhou , Binbin Liu , Ping Guo , Zijun Wang , Bingni Zhang , Yifan Zhang , Yifeng Yu , Xiaohuan Zhou , Taifeng Wang

Using Large Language Models for Hyperparameter Optimization

This paper explores the use of foundational large language models (LLMs) in hyperparameter optimization (HPO). Hyperparameters are critical in determining the effectiveness of machine learning models, yet their optimization often relies on…

Machine Learning · Computer Science 2024-11-12 Michael R. Zhang , Nishkrit Desai , Juhan Bae , Jonathan Lorraine , Jimmy Ba

Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments

Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can…

Computation and Language · Computer Science 2022-10-19 Maor Ivgi , Yair Carmon , Jonathan Berant