On the Optimizer Dependence of Neural Scaling Laws

Authors: Vansh Ramani, Shourya Vir Jain

Machine LearningArtificial Intelligencestat.ML2026-05v1license

Abstract

The scaling exponent $\alpha$ in neural scaling laws $L(N) \propto N^{-\alpha}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $\alpha$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $\alpha$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $\alpha$ ), with the $\alpha$ -shift increasing across most of the tested spectral range, peaking near $s = 1.5$ , and remaining large at $s = 2.0$ . At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $\alpha \approx 0.31$ versus $\alpha \approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

Cite

@article{arxiv.2605.29387,
  title  = {On the Optimizer Dependence of Neural Scaling Laws},
  author = {Vansh Ramani and Shourya Vir Jain},
  journal= {arXiv preprint arXiv:2605.29387},
  year   = {2026}
}

← Machine Learning · Home