Related papers: When Does Preconditioning Help or Hurt Generalizat…

How Does Label Noise Gradient Descent Improve Generalization in the Low SNR Regime?

The capacity of deep learning models is often large enough to both learn the underlying statistical signal and overfit to noise in the training set. This noise memorization can be harmful especially for data with a low signal-to-noise ratio…

Machine Learning · Computer Science 2025-10-21 Wei Huang , Andi Han , Yujin Song , Yilan Chen , Denny Wu , Difan Zou , Taiji Suzuki

Label Noise SGD Provably Prefers Flat Global Minimizers

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training…

Machine Learning · Computer Science 2021-12-07 Alex Damian , Tengyu Ma , Jason D. Lee

Understanding the Effects of Second-Order Approximations in Natural Policy Gradient Reinforcement Learning

Natural policy gradient methods are popular reinforcement learning methods that improve the stability of policy gradient methods by utilizing second-order approximations to precondition the gradient with the inverse of the…

Machine Learning · Computer Science 2022-10-12 Brennan Gebotys , Alexander Wong , David A. Clausi

Adam or Gauss-Newton? A Comparative Study In Terms of Basis Alignment and SGD Noise

Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and…

Machine Learning · Computer Science 2025-10-16 Bingbin Liu , Rachit Bansal , Depen Morwani , Nikhil Vyas , David Alvarez-Melis , Sham M. Kakade

Bad Global Minima Exist and SGD Can Reach Them

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the…

Machine Learning · Computer Science 2021-02-24 Shengchao Liu , Dimitris Papailiopoulos , Dimitris Achlioptas

Inherent Noise in Gradient Based Methods

Previous work has examined the ability of larger capacity neural networks to generalize better than smaller ones, even without explicit regularizers, by analyzing gradient based algorithms such as GD and SGD. The presence of noise and its…

Machine Learning · Computer Science 2020-05-27 Arushi Gupta

Generalization Guarantees of Gradient Descent for Multi-Layer Neural Networks

Recently, significant progress has been made in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on…

Machine Learning · Computer Science 2025-07-22 Puyu Wang , Yunwen Lei , Di Wang , Yiming Ying , Ding-Xuan Zhou

Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee

Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other…

Machine Learning · Computer Science 2020-10-06 Wei Hu , Zhiyuan Li , Dingli Yu

Regularization in network optimization via trimmed stochastic gradient descent with noisy label

Regularization is essential for avoiding over-fitting to training data in network optimization, leading to better generalization of the trained networks. The label noise provides a strong implicit regularization by replacing the target…

Machine Learning · Computer Science 2022-05-04 Kensuke Nakamura , Bong-Soo Sohn , Kyoung-Jae Won , Byung-Woo Hong

Bias of Stochastic Gradient Descent or the Architecture: Disentangling the Effects of Overparameterization of Neural Networks

Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic…

Machine Learning · Computer Science 2025-02-04 Amit Peleg , Matthias Hein

On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific…

Machine Learning · Computer Science 2026-05-08 Shuai Jiang , Alexey Voronin , Eric Cyr , Ben Southworth

Rethinking Gauss-Newton for learning over-parameterized models

This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the…

Machine Learning · Computer Science 2023-12-13 Michael Arbel , Romain Menegaux , Pierre Wolinski

On the Generalization Mystery in Deep Learning

The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of…

Machine Learning · Computer Science 2022-06-07 Satrajit Chatterjee , Piotr Zielinski

Stochastic Resetting Mitigates Latent Gradient Bias of SGD from Label Noise

Giving up and starting over may seem wasteful in many situations such as searching for a target or training deep neural networks (DNNs). Our study, though, demonstrates that resetting from a checkpoint can significantly improve…

Machine Learning · Computer Science 2025-03-14 Youngkyoung Bae , Yeongwoo Song , Hawoong Jeong

Implicit vs. explicit regularization for high-dimensional gradient descent

In this paper we investigate the generalization error of gradient descent (GD) applied to an $\ell_2$-regularized OLS objective function in the linear model. Based on our analysis we develop new methodology for computationally tractable and…

Statistics Theory · Mathematics 2026-01-27 Thomas Stark , Lukas Steinberger

Conflicting Biases at the Edge of Stability: Norm versus Sharpness Regularization

A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this…

Machine Learning · Computer Science 2025-12-19 Maria Matveev , Vit Fojtik , Hung-Hsu Chou , Gitta Kutyniok , Johannes Maly

SGD Generalizes Better Than GD (And Regularization Doesn't Help)

We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that…

Machine Learning · Computer Science 2021-07-01 Idan Amir , Tomer Koren , Roi Livni

How Does Preconditioning Guide Feature Learning in Deep Neural Networks?

Preconditioning is widely used in machine learning to accelerate convergence on the empirical risk, yet its role on the expected risk remains underexplored. In this work, we investigate how preconditioning affects feature learning and…

Machine Learning · Computer Science 2025-10-01 Kotaro Yoshida , Atsushi Nitanda

Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. While much promising work has been accomplished to address this fragility, a systematic study of…

Machine Learning · Computer Science 2023-06-07 Hiroki Naganuma , Kartik Ahuja , Shiro Takagi , Tetsuya Motokawa , Rio Yokota , Kohta Ishikawa , Ikuro Sato , Ioannis Mitliagkas

On-Average Stability of Multipass Preconditioned SGD and Effective Dimension

We study trade-offs between the population risk curvature, geometry of the noise, and preconditioning on the generalisation ability of the multipass Preconditioned Stochastic Gradient Descent (PSGD). Many practical optimisation heuristics…

Machine Learning · Computer Science 2026-03-13 Simon Vary , Tyler Farghly , Ilja Kuzborskij , Patrick Rebeschini