Related papers: Gradient Methods Never Overfit On Separable Data
We study the generalization properties of unregularized gradient methods applied to separable linear classification -- a setting that has received considerable attention since the pioneering work of Soudry et al. (2018). We establish tight…
We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the…
An influential line of recent work has focused on the generalization properties of unregularized gradient-based learning procedures applied to separable linear classification with exponentially-tailed loss functions. The ability of such…
We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The…
Gradient descent and stochastic gradient descent are central to modern machine learning, yet their behavior under large step sizes remains theoretically unclear. Recent work suggests that acceleration often arises near the edge of…
Gradient-based iterative optimization methods are the workhorse of modern machine learning. They crucially rely on careful tuning of parameters like learning rate and momentum. However, one typically sets them using heuristic approaches…
The main aim of this paper is to provide an analysis of gradient descent (GD) algorithms with gradient errors that do not necessarily vanish, asymptotically. In particular, sufficient conditions are presented for both stability (almost sure…
Adaptive gradient methods have achieved remarkable success in training deep neural networks on a wide variety of tasks. However, not much is known about the mathematical and statistical properties of this family of methods. This work aims…
Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training…
We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive…
Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee…
Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data.…
The remarkable practical success of deep learning has revealed some major surprises from a theoretical perspective. In particular, simple gradient methods easily find near-optimal solutions to non-convex optimization problems, and despite…
In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a…
This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast…
Gradient descent is one of the most widely used iterative algorithms in modern statistical learning. However, its precise algorithmic dynamics in high-dimensional settings remain only partially understood, which has limited its broader…
A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the…
Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin…
A vast literature on convergence guarantees for gradient descent and derived methods exists at the moment. However, a simple practical situation remains unexplored: when a fixed step size is used, can we expect gradient descent to converge…
Modern machine learning classifiers often exhibit vanishing classification error on the training set. They achieve this by learning nonlinear representations of the inputs that maps the data into linearly separable classes. Motivated by…