Related papers: Why Does Multi-Epoch Training Help?

Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime

Stochastic gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. Most of existing generalization analyses are made for single-pass SGD, which is a less practical variant…

Machine Learning · Computer Science 2022-03-08 Difan Zou , Jingfeng Wu , Vladimir Braverman , Quanquan Gu , Sham M. Kakade

Benign Underfitting of Stochastic Gradient Descent

We study to what extent may stochastic gradient descent (SGD) be understood as a "conventional" learning rule that achieves generalization performance by obtaining a good fit to training data. We consider the fundamental stochastic convex…

Machine Learning · Computer Science 2023-01-13 Tomer Koren , Roi Livni , Yishay Mansour , Uri Sherman

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on…

Machine Learning · Computer Science 2018-11-26 Loucas Pillaud-Vivien , Alessandro Rudi , Francis Bach

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency…

Machine Learning · Statistics 2022-06-16 Courtney Paquette , Elliot Paquette , Ben Adlam , Jeffrey Pennington

Escaping Saddle Points Faster with Stochastic Momentum

Stochastic gradient descent (SGD) with stochastic momentum is popular in nonconvex stochastic optimization and particularly for the training of deep neural networks. In standard SGD, parameters are updated by improving along the path of the…

Machine Learning · Computer Science 2021-06-08 Jun-Kun Wang , Chi-Heng Lin , Jacob Abernethy

Reinforced stochastic gradient descent for deep neural network learning

Stochastic gradient descent (SGD) is a standard optimization method to minimize a training error with respect to network parameters in modern neural network learning. However, it typically suffers from proliferation of saddle points in the…

Machine Learning · Computer Science 2017-11-23 Haiping Huang , Taro Toyoizumi

Rapid Overfitting of Multi-Pass Stochastic Gradient Descent in Stochastic Convex Optimization

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population…

Machine Learning · Computer Science 2025-05-16 Shira Vansover-Hager , Tomer Koren , Roi Livni

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

Multi-epoch, small-batch, Stochastic Gradient Descent (SGD) has been the method of choice for learning with large over-parameterized models. A popular theory for explaining why SGD works well in practice is that the algorithm has an…

Machine Learning · Computer Science 2021-07-13 Satyen Kale , Ayush Sekhari , Karthik Sridharan

On the Convergence of SGD Training of Neural Networks

Neural networks are usually trained by some form of stochastic gradient descent (SGD)). A number of strategies are in common use intended to improve SGD optimization, such as learning rate schedules, momentum, and batching. These are…

Neural and Evolutionary Computing · Computer Science 2015-08-13 Thomas M. Breuel

Stagewise Training Accelerates Convergence of Testing Error Over SGD

Stagewise training strategy is widely used for learning neural networks, which runs a stochastic algorithm (e.g., SGD) starting with a relatively large step size (aka learning rate) and geometrically decreasing the step size after a number…

Machine Learning · Statistics 2019-02-05 Zhuoning Yuan , Yan Yan , Rong Jin , Tianbao Yang

On the Convergence of Stochastic Gradient Descent with Perturbed Forward-Backward Passes

We study stochastic gradient descent (SGD) for composite optimization problems with $N$ sequential operators subject to perturbations in both the forward and backward passes. Unlike classical analyses that treat gradient noise as additive…

Optimization and Control · Mathematics 2026-02-25 Boao Kong , Hengrui Zhang , Kun Yuan

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear…

Machine Learning · Computer Science 2021-12-08 Scott Pesme , Loucas Pillaud-Vivien , Nicolas Flammarion

Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. However, beyond linear regression, the theoretical advantage of full-batch gradient descent (GD, which always reuses…

Machine Learning · Statistics 2026-02-03 Filip Kovačević , Hong Chang Ji , Denny Wu , Mahdi Soltanolkotabi , Marco Mondelli

No More Pesky Learning Rates

The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any…

Machine Learning · Statistics 2013-02-19 Tom Schaul , Sixin Zhang , Yann LeCun

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of…

Machine Learning · Computer Science 2025-05-13 Davide Barbieri , Matteo Bonforte , Peio Ibarrondo

Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent

For large scale learning problems, it is desirable if we can obtain the optimal model parameters by going through the data in only one pass. Polyak and Juditsky (1992) showed that asymptotically the test performance of the simple average of…

Machine Learning · Computer Science 2011-12-23 Wei Xu

Masked Training of Neural Networks with Partial Gradients

State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD). Recently, many variations have been explored: perturbing parameters for better accuracy (such as in Extragradient), limiting SGD…

Machine Learning · Computer Science 2022-03-23 Amirkeivan Mohtashami , Martin Jaggi , Sebastian U. Stich

When Does Stochastic Gradient Algorithm Work Well?

In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a…

Machine Learning · Statistics 2018-12-27 Lam M. Nguyen , Nam H. Nguyen , Dzung T. Phan , Jayant R. Kalagnanam , Katya Scheinberg

How Good is SGD with Random Shuffling?

We study the performance of stochastic gradient descent (SGD) on smooth and strongly-convex finite-sum optimization problems. In contrast to the majority of existing theoretical works, which assume that individual functions are sampled with…

Machine Learning · Computer Science 2021-06-03 Itay Safran , Ohad Shamir

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

The early phase of training of deep neural networks is critical for their final performance. In this work, we study how the hyperparameters of stochastic gradient descent (SGD) used in the early phase of training affect the rest of the…

Machine Learning · Computer Science 2020-02-25 Stanislaw Jastrzebski , Maciej Szymczak , Stanislav Fort , Devansh Arpit , Jacek Tabor , Kyunghyun Cho , Krzysztof Geras