English
Related papers

Related papers: When Does Preconditioning Help or Hurt Generalizat…

200 papers

The capacity of deep learning models is often large enough to both learn the underlying statistical signal and overfit to noise in the training set. This noise memorization can be harmful especially for data with a low signal-to-noise ratio…

Machine Learning · Computer Science 2025-10-21 Wei Huang , Andi Han , Yujin Song , Yilan Chen , Denny Wu , Difan Zou , Taiji Suzuki

In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training…

Machine Learning · Computer Science 2021-12-07 Alex Damian , Tengyu Ma , Jason D. Lee

Natural policy gradient methods are popular reinforcement learning methods that improve the stability of policy gradient methods by utilizing second-order approximations to precondition the gradient with the inverse of the…

Machine Learning · Computer Science 2022-10-12 Brennan Gebotys , Alexander Wong , David A. Clausi

Diagonal preconditioners are computationally feasible approximate to second-order optimizers, which have shown significant promise in accelerating training of deep learning models. Two predominant approaches are based on Adam and…

Machine Learning · Computer Science 2025-10-16 Bingbin Liu , Rachit Bansal , Depen Morwani , Nikhil Vyas , David Alvarez-Melis , Sham M. Kakade

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the…

Machine Learning · Computer Science 2021-02-24 Shengchao Liu , Dimitris Papailiopoulos , Dimitris Achlioptas

Previous work has examined the ability of larger capacity neural networks to generalize better than smaller ones, even without explicit regularizers, by analyzing gradient based algorithms such as GD and SGD. The presence of noise and its…

Machine Learning · Computer Science 2020-05-27 Arushi Gupta

Recently, significant progress has been made in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on…

Machine Learning · Computer Science 2025-07-22 Puyu Wang , Yunwen Lei , Di Wang , Yiming Ying , Ding-Xuan Zhou

Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. Such over-fitting ability hinders generalization when mislabeled training examples are present. On the other…

Machine Learning · Computer Science 2020-10-06 Wei Hu , Zhiyuan Li , Dingli Yu

Regularization is essential for avoiding over-fitting to training data in network optimization, leading to better generalization of the trained networks. The label noise provides a strong implicit regularization by replacing the target…

Machine Learning · Computer Science 2022-05-04 Kensuke Nakamura , Bong-Soo Sohn , Kyoung-Jae Won , Byung-Woo Hong

Neural networks typically generalize well when fitting the data perfectly, even though they are heavily overparameterized. Many factors have been pointed out as the reason for this phenomenon, including an implicit bias of stochastic…

Machine Learning · Computer Science 2025-02-04 Amit Peleg , Matthias Hein

Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific…

Machine Learning · Computer Science 2026-05-08 Shuai Jiang , Alexey Voronin , Eric Cyr , Ben Southworth

This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the…

Machine Learning · Computer Science 2023-12-13 Michael Arbel , Romain Menegaux , Pierre Wolinski

The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of…

Machine Learning · Computer Science 2022-06-07 Satrajit Chatterjee , Piotr Zielinski

Giving up and starting over may seem wasteful in many situations such as searching for a target or training deep neural networks (DNNs). Our study, though, demonstrates that resetting from a checkpoint can significantly improve…

Machine Learning · Computer Science 2025-03-14 Youngkyoung Bae , Yeongwoo Song , Hawoong Jeong

In this paper we investigate the generalization error of gradient descent (GD) applied to an $\ell_2$-regularized OLS objective function in the linear model. Based on our analysis we develop new methodology for computationally tractable and…

Statistics Theory · Mathematics 2026-01-27 Thomas Stark , Lukas Steinberger

A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this…

Machine Learning · Computer Science 2025-12-19 Maria Matveev , Vit Fojtik , Hung-Hsu Chou , Gitta Kutyniok , Johannes Maly

We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that…

Machine Learning · Computer Science 2021-07-01 Idan Amir , Tomer Koren , Roi Livni

Preconditioning is widely used in machine learning to accelerate convergence on the empirical risk, yet its role on the expected risk remains underexplored. In this work, we investigate how preconditioning affects feature learning and…

Machine Learning · Computer Science 2025-10-01 Kotaro Yoshida , Atsushi Nitanda

Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. While much promising work has been accomplished to address this fragility, a systematic study of…

We study trade-offs between the population risk curvature, geometry of the noise, and preconditioning on the generalisation ability of the multipass Preconditioned Stochastic Gradient Descent (PSGD). Many practical optimisation heuristics…

Machine Learning · Computer Science 2026-03-13 Simon Vary , Tyler Farghly , Ilja Kuzborskij , Patrick Rebeschini
‹ Prev 1 2 3 10 Next ›