HomeMachine LearningarXiv:2605.29387

On the Optimizer Dependence of Neural Scaling Laws

Abstract

The scaling exponent α\alpha in neural scaling laws L(N)NαL(N) \propto N^{-\alpha} is commonly treated as a fixed constant set by architecture and data. We present evidence that α\alpha depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure α\alpha across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger α\alpha), with the α\alpha-shift increasing across most of the tested spectral range, peaking near s=1.5s = 1.5, and remaining large at s=2.0s = 2.0. At s1.0s \approx 1.0 (characteristic of natural language), the full natural gradient achieves α0.31\alpha \approx 0.31 versus α0.12\alpha \approx 0.12 for gradient descent -- a 2.6×2.6\times larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

Cite

@article{arxiv.2605.29387,
  title  = {On the Optimizer Dependence of Neural Scaling Laws},
  author = {Vansh Ramani and Shourya Vir Jain},
  journal= {arXiv preprint arXiv:2605.29387},
  year   = {2026}
}