On the Optimizer Dependence of Neural Scaling Laws
Abstract
The scaling exponent in neural scaling laws is commonly treated as a fixed constant set by architecture and data. We present evidence that depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger ), with the -shift increasing across most of the tested spectral range, peaking near , and remaining large at . At (characteristic of natural language), the full natural gradient achieves versus for gradient descent -- a larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.
Cite
@article{arxiv.2605.29387,
title = {On the Optimizer Dependence of Neural Scaling Laws},
author = {Vansh Ramani and Shourya Vir Jain},
journal= {arXiv preprint arXiv:2605.29387},
year = {2026}
}