Related papers: Gradient Methods Never Overfit On Separable Data

Tight Risk Bounds for Gradient Descent on Separable Data

We study the generalization properties of unregularized gradient methods applied to separable linear classification -- a setting that has received considerable attention since the pioneering work of Soudry et al. (2018). We establish tight…

Machine Learning · Computer Science 2023-03-03 Matan Schliserman , Tomer Koren

Convergence of Gradient Descent on Separable Data

We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the…

Machine Learning · Statistics 2019-03-26 Mor Shpigel Nacson , Jason D. Lee , Suriya Gunasekar , Pedro H. P. Savarese , Nathan Srebro , Daniel Soudry

Stability vs Implicit Bias of Gradient Methods on Separable Data and Beyond

An influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such…

Machine Learning · Computer Science 2022-06-24 Matan Schliserman , Tomer Koren

The Implicit Bias of Gradient Descent on Separable Data

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The…

Machine Learning · Statistics 2024-10-29 Daniel Soudry , Elad Hoffer , Mor Shpigel Nacson , Suriya Gunasekar , Nathan Srebro

Exponential Convergence of (Stochastic) Gradient Descent for Separable Logistic Regression

Gradient descent and stochastic gradient descent are central to modern machine learning, yet their behavior under large step sizes remains theoretically unclear. Recent work suggests that acceleration often arises near the edge of…

Machine Learning · Computer Science 2026-03-02 Sacchit Kale , Piyushi Manupriya , Pierre Marion , Francis Bach , Anant Raj

Gradient Descent with Provably Tuned Learning-rate Schedules

Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches…

Machine Learning · Computer Science 2025-12-05 Dravyansh Sharma

Analysis of gradient descent methods with non-diminishing, bounded errors

The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure…

Systems and Control · Computer Science 2017-09-19 Arunselvan Ramaswamy , Shalabh Bhatnagar

On the Distributional Properties of Adaptive Gradients

Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims…

Machine Learning · Computer Science 2021-05-18 Zhang Zhiyi , Liu Ziyin

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training…

Machine Learning · Computer Science 2018-12-27 Samet Oymak , Mahdi Soltanolkotabi

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive…

Machine Learning · Computer Science 2024-10-11 Semih Cayci , Atilla Eryilmaz

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee…

Machine Learning · Computer Science 2019-10-22 Xiaoxia Wu , Simon S. Du , Rachel Ward

Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)

Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data.…

Machine Learning · Computer Science 2021-02-22 Sharan Vaswani , Issam Laradji , Frederik Kunstner , Si Yi Meng , Mark Schmidt , Simon Lacoste-Julien

Deep learning: a statistical viewpoint

The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite…

Statistics Theory · Mathematics 2021-03-17 Peter L. Bartlett , Andrea Montanari , Alexander Rakhlin

When Does Stochastic Gradient Algorithm Work Well?

In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a…

Machine Learning · Statistics 2018-12-27 Lam M. Nguyen , Nam H. Nguyen , Dzung T. Phan , Jayant R. Kalagnanam , Katya Scheinberg

Characterizing the implicit bias via a primal-dual analysis

This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast…

Machine Learning · Computer Science 2020-11-13 Ziwei Ji , Matus Telgarsky

Gradient descent inference in empirical risk minimization

Gradient descent is one of the most widely used iterative algorithms in modern statistical learning. However, its precise algorithmic dynamics in high-dimensional settings remain only partially understood, which has limited its broader…

Statistics Theory · Mathematics 2025-11-19 Qiyang Han , Xiaocong Xu

An Improved Analysis of Training Over-parameterized Deep Neural Networks

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the…

Machine Learning · Computer Science 2019-06-12 Difan Zou , Quanquan Gu

Distribution of Classification Margins: Are All Data Equal?

Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin…

Machine Learning · Computer Science 2021-07-22 Andrzej Banburski , Fernanda De La Torre , Nishka Pant , Ishana Shastri , Tomaso Poggio

On the Convergence of Gradient Descent for Large Learning Rates

A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge…

Machine Learning · Computer Science 2024-12-10 Alexandru Crăciun , Debarghya Ghoshdastidar

The generalization error of max-margin linear classifiers: Benign overfitting and high dimensional asymptotics in the overparametrized regime

Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by…

Statistics Theory · Mathematics 2023-03-23 Andrea Montanari , Feng Ruan , Youngtak Sohn , Jun Yan