English

Asynchronous Heavy-Tailed Optimization

Machine Learning 2026-02-23 v1

Abstract

Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to address heavy-tailed noise in the centralized or distributed, synchronous setting, leaving the interactions between such noise and asynchronous optimization underexplored. In this work, we investigate two communication schemes that handle stragglers with asynchronous updates in the presence of heavy-tailed gradient noise. We propose and theoretically analyze algorithmic modifications based on delay-aware learning rate scheduling and delay compensation to enhance the performance of asynchronous algorithms. Our convergence guarantees under heavy-tailed noise match the rate of the synchronous counterparts and improve delay tolerance compared with existing asynchronous approaches. Empirically, our approaches outperform prior synchronous and asynchronous methods in terms of accuracy/runtime trade-offs and are more robust to hyperparameters in both image and language tasks.

Keywords

Cite

@article{arxiv.2602.18002,
  title  = {Asynchronous Heavy-Tailed Optimization},
  author = {Junfei Sun and Dixi Yao and Xuchen Gong and Tahseen Rabbani and Manzil Zaheer and Tian Li},
  journal= {arXiv preprint arXiv:2602.18002},
  year   = {2026}
}

Comments

8-page main body, 25-page appendix, 5 figures