Related papers: Deep Linear Network Training Dynamics from Random …

High-dimensional dynamics of generalization error in neural networks

We perform an average case analysis of the generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on…

Machine Learning · Statistics 2017-10-11 Madhu S. Advani , Andrew M. Saxe

Convergence of Gradient Descent for Recurrent Neural Networks: A Nonasymptotic Analysis

We analyze recurrent neural networks with diagonal hidden-to-hidden weight matrices, trained with gradient descent in the supervised learning setting, and prove that gradient descent can achieve optimality \emph{without} massive…

Machine Learning · Computer Science 2024-10-11 Semih Cayci , Atilla Eryilmaz

Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks

Neural networks trained via gradient descent with random initialization and without any regularization enjoy good generalization performance in practice despite being highly overparametrized. A promising direction to explain this phenomenon…

Machine Learning · Computer Science 2022-05-17 Hancheng Min , Salma Tarmoun , Rene Vidal , Enrique Mallada

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we…

Machine Learning · Statistics 2021-02-03 Jaehoon Lee , Lechao Xiao , Samuel S. Schoenholz , Yasaman Bahri , Roman Novak , Jascha Sohl-Dickstein , Jeffrey Pennington

Overparameterization of deep ResNet: zero loss and mean-field analysis

Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global optimizer with perfect fit (zero-loss) in many…

Machine Learning · Computer Science 2025-03-07 Zhiyan Ding , Shi Chen , Qin Li , Stephen Wright

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep…

Neural and Evolutionary Computing · Computer Science 2014-02-20 Andrew M. Saxe , James L. McClelland , Surya Ganguli

A Comparative Analysis of the Optimization and Generalization Property of Two-layer Neural Network and Random Feature Models Under Gradient Descent Dynamics

A fairly comprehensive analysis is presented for the gradient descent dynamics for training two-layer neural network models in the situation when the parameters in both layers are updated. General initialization schemes as well as general…

Machine Learning · Computer Science 2020-02-27 Weinan E , Chao Ma , Lei Wu

A Polynomial-Based Approach for Architectural Design and Learning with Deep Neural Networks

In this effort we propose a novel approach for reconstructing multivariate functions from training data, by identifying both a suitable network architecture and an initialization using polynomial-based approximations. Training deep neural…

Machine Learning · Computer Science 2019-05-29 Joseph Daws , Clayton G. Webster

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the…

Machine Learning · Statistics 2023-12-11 Blake Bordelon , Lorenzo Noci , Mufan Bill Li , Boris Hanin , Cengiz Pehlevan

High-Dimensional Analysis of Gradient Flow for Extensive-Width Quadratic Neural Networks

We study the high-dimensional training dynamics of a shallow neural network with quadratic activation in a teacher-student setup. We focus on the extensive-width regime, where the teacher and student network widths scale proportionally with…

Optimization and Control · Mathematics 2026-01-16 Simon Martin , Giulio Biroli , Francis Bach

Precise gradient descent training dynamics for finite-width multi-layer neural networks

In this paper, we provide the first precise distributional characterization of gradient descent iterates for general multi-layer neural networks under the canonical single-index regression model, in the `finite-width proportional regime'…

Machine Learning · Computer Science 2025-05-09 Qiyang Han , Masaaki Imaizumi

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit

To theoretically understand the behavior of trained deep neural networks, it is necessary to study the dynamics induced by gradient methods from a random initialization. However, the nonlinear and compositional structure of these models…

Machine Learning · Computer Science 2021-12-21 Karl Hajjar , Lénaïc Chizat , Christophe Giraud

An Improved Analysis of Training Over-parameterized Deep Neural Networks

A recent line of research has shown that gradient-based algorithms with random initialization can converge to the global minima of the training loss for over-parameterized (i.e., sufficiently wide) deep neural networks. However, the…

Machine Learning · Computer Science 2019-06-12 Difan Zou , Quanquan Gu

Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

The skip-connections used in residual networks have become a standard architecture choice in deep learning due to the increased training stability and generalization performance with this architecture, although there has been limited…

Machine Learning · Computer Science 2019-10-08 Spencer Frei , Yuan Cao , Quanquan Gu

Infinite-width limit of deep linear neural networks

This paper studies the infinite-width limit of deep linear neural networks initialized with random parameters. We obtain that, when the number of neurons diverges, the training dynamics converge (in a precise sense) to the dynamics obtained…

Machine Learning · Computer Science 2022-12-01 Lénaïc Chizat , Maria Colombo , Xavier Fernández-Real , Alessio Figalli

Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks

When optimizing over-parameterized models, such as deep neural networks, a large set of parameters can achieve zero training error. In such cases, the choice of the optimization algorithm and its respective hyper-parameters introduces…

Machine Learning · Computer Science 2019-12-06 Gauthier Gidel , Francis Bach , Simon Lacoste-Julien

Effects of Depth, Width, and Initialization: A Convergence Analysis of Layer-wise Training for Deep Linear Neural Networks

Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. However, training deep neural networks is a challenging task. Many alternatives have been proposed in place of…

Machine Learning · Computer Science 2020-09-09 Yeonjong Shin

An analytic theory of generalization dynamics and transfer learning in deep linear networks

Much attention has been devoted recently to the generalization puzzle in deep learning: large, deep networks can generalize well, but existing theories bounding generalization error are exceedingly loose, and thus cannot explain this…

Machine Learning · Statistics 2019-01-08 Andrew K. Lampinen , Surya Ganguli

Theory of Deep Learning III: explaining the non-overfitting puzzle

A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics…

Machine Learning · Computer Science 2018-01-17 Tomaso Poggio , Kenji Kawaguchi , Qianli Liao , Brando Miranda , Lorenzo Rosasco , Xavier Boix , Jack Hidary , Hrushikesh Mhaskar

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies…

Machine Learning · Computer Science 2019-02-06 Simon S. Du , Xiyu Zhai , Barnabas Poczos , Aarti Singh