Related papers: Automatic Gradient Descent: Deep Learning without …

Learning by Turning: Neural Architecture Aware Optimisation

Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper…

Neural and Evolutionary Computing · Computer Science 2021-09-21 Yang Liu , Jeremy Bernstein , Markus Meister , Yisong Yue

Gradient Descent as Implicit EM in Distance-Based Neural Models

Neural networks trained with standard objectives exhibit behaviors characteristic of probabilistic inference: soft clustering, prototype specialization, and Bayesian uncertainty tracking. These phenomena appear across architectures -- in…

Machine Learning · Computer Science 2026-01-01 Alan Oursland

Learning to Learn without Gradient Descent by Gradient Descent

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad…

Machine Learning · Statistics 2017-06-13 Yutian Chen , Matthew W. Hoffman , Sergio Gomez Colmenarejo , Misha Denil , Timothy P. Lillicrap , Matt Botvinick , Nando de Freitas

A Framework for Overparameterized Learning

A candidate explanation of the good empirical performance of deep neural networks is the implicit regularization effect of first order optimization methods. Inspired by this, we prove a convergence theorem for nonconvex composite…

Machine Learning · Computer Science 2023-02-14 Dávid Terjék , Diego González-Sánchez

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive…

Machine Learning · Computer Science 2024-10-11 Semih Cayci , Atilla Eryilmaz

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a…

Machine Learning · Computer Science 2021-08-26 Difan Zou , Yuan Cao , Yuanzhi Li , Quanquan Gu

Gradient Descent based Optimization Algorithms for Deep Learning Models Training

In this paper, we aim at providing an introduction to the gradient descent based optimization algorithms for learning deep neural network models. Deep learning models involving multiple nonlinear projection layers are very challenging to…

Machine Learning · Computer Science 2019-03-12 Jiawei Zhang

The duality structure gradient descent algorithm: analysis and applications to neural networks

The training of machine learning models is typically carried out using some form of gradient descent, often with great success. However, non-asymptotic analyses of first-order optimization algorithms typically employ a gradient smoothness…

Machine Learning · Computer Science 2024-06-18 Thomas Flynn

Theory of Deep Learning III: explaining the non-overfitting puzzle

A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics…

Machine Learning · Computer Science 2018-01-17 Tomaso Poggio , Kenji Kawaguchi , Qianli Liao , Brando Miranda , Lorenzo Rosasco , Xavier Boix , Jack Hidary , Hrushikesh Mhaskar

Practical recommendations for gradient-based training of deep architectures

Learning algorithms related to artificial neural networks and in particular for Deep Learning may seem to involve many bells and whistles, called hyper-parameters. This chapter is meant as a practical guide with recommendations for some of…

Machine Learning · Computer Science 2012-09-18 Yoshua Bengio

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows…

Machine Learning · Computer Science 2026-03-03 Tom Jacobs , Chao Zhou , Rebekka Burkholz

Gradient Descent Finds Global Minima of Deep Neural Networks

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized…

Machine Learning · Computer Science 2019-05-30 Simon S. Du , Jason D. Lee , Haochuan Li , Liwei Wang , Xiyu Zhai

Deep Networks from the Principle of Rate Reduction

This work attempts to interpret modern deep (convolutional) networks from the principles of rate reduction and (shift) invariant classification. We show that the basic iterative gradient ascent scheme for optimizing the rate reduction of…

Machine Learning · Computer Science 2020-10-30 Kwan Ho Ryan Chan , Yaodong Yu , Chong You , Haozhi Qi , John Wright , Yi Ma

Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks

We present a framework to define a large class of neural networks for which, by construction, training by gradient flow provably reaches arbitrarily low loss when the number of parameters grows. Distinct from the fixed-space global…

Optimization and Control · Mathematics 2025-01-13 David A. R. Robin , Kevin Scaman , Marc Lelarge

Optimization-Based Separations for Neural Networks

Depth separation results propose a possible theoretical explanation for the benefits of deep neural networks over shallower architectures, establishing that the former possess superior approximation capabilities. However, there are no known…

Machine Learning · Computer Science 2023-02-03 Itay Safran , Jason D. Lee

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models,…

Machine Learning · Computer Science 2025-12-23 Ansh Nagwekar

Gradient Descent: The Ultimate Optimizer

Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model…

Machine Learning · Computer Science 2022-10-18 Kartik Chandra , Audrey Xie , Jonathan Ragan-Kelley , Erik Meijer

Meta Mirror Descent: Optimiser Learning for Fast Convergence

Optimisers are an essential component for training machine learning models, and their design influences learning speed and generalisation. Several studies have attempted to learn more effective gradient-descent optimisers via solving a…

Machine Learning · Computer Science 2022-03-08 Boyan Gao , Henry Gouk , Hae Beom Lee , Timothy M. Hospedales

Depth Without the Magic: Inductive Bias of Natural Gradient Descent

In gradient descent, changing how we parametrize the model can lead to drastically different optimization trajectories, giving rise to a surprising range of meaningful inductive biases: identifying sparse classifiers or reconstructing…

Machine Learning · Statistics 2021-11-24 Anna Kerekes , Anna Mészáros , Ferenc Huszár

A Comprehensive Study on Optimization Strategies for Gradient Descent In Deep Learning

One of the most important parts of Artificial Neural Networks is minimizing the loss functions which tells us how good or bad our model is. To minimize these losses we need to tune the weights and biases. Also to calculate the minimum value…

Machine Learning · Computer Science 2021-01-08 Kaustubh Yadav