English

Optimization Hyper-parameter Laws for Large Language Models

Machine Learning 2026-05-21 v4 Optimization and Control

Abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that predicts final training loss as a function of LR schedule, model size, and data size. Grounded in SDE-based convergence and escape analyses, Opt-Laws yield interpretable convergence and escape features that predict final training loss across model scales, enabling schedule pre-selection from small-scale experiments. Empirically, Opt-Laws achieve a 94% Top-2 hit rate for identifying near-optimal schedule candidates on held-out configurations, correctly identify the best-performing schedule family in all five evaluated out-of-family settings, and detect training divergence with F1 = 0.92.

Keywords

Cite

@article{arxiv.2409.04777,
  title  = {Optimization Hyper-parameter Laws for Large Language Models},
  author = {Xingyu Xie and Kuangyu Ding and Shuicheng Yan and Kim-Chuan Toh and Tianwen Wei},
  journal= {arXiv preprint arXiv:2409.04777},
  year   = {2026}
}
R2 v1 2026-06-28T18:37:16.064Z