Related papers: Gradient Descent Quantizes ReLU Network Features

A global convergence theory for deep ReLU implicit networks via over-parameterization

Implicit deep learning has received increasing attention recently due to the fact that it generalizes the recursive prediction rules of many commonly used neural network architectures. Its prediction rule is provided implicitly based on the…

Machine Learning · Computer Science 2022-02-21 Tianxiang Gao , Hailiang Liu , Jia Liu , Hridesh Rajan , Hongyang Gao

Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very…

Machine Learning · Computer Science 2019-11-28 Yuan Cao , Quanquan Gu

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for…

Machine Learning · Computer Science 2018-12-31 Difan Zou , Yuan Cao , Dongruo Zhou , Quanquan Gu

On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective

We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD…

Machine Learning · Computer Science 2019-09-04 Lili Su , Pengkun Yang

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class…

Machine Learning · Computer Science 2019-08-02 Yuanzhi Li , Yingyu Liang

Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks

Understanding the properties of neural networks trained via stochastic gradient descent (SGD) is at the heart of the theory of deep learning. In this work, we take a mean-field view, and consider a two-layer ReLU network trained via SGD for…

Machine Learning · Computer Science 2022-05-02 Alexander Shevchenko , Vyacheslav Kungurtsev , Marco Mondelli

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies…

Machine Learning · Computer Science 2019-02-06 Simon S. Du , Xiyu Zhai , Barnabas Poczos , Aarti Singh

Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A satisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the following questions: 1)…

Machine Learning · Computer Science 2019-08-27 Tomaso Poggio , Andrzej Banburski , Qianli Liao

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization…

Machine Learning · Statistics 2026-03-06 Kuo-Wei Lai , Guanghui Wang , Molei Tao , Vidya Muthukumar

The Dynamics of Gradient Descent for Overparametrized Neural Networks

We consider the dynamics of gradient descent (GD) in overparameterized single hidden layer neural networks with a squared loss function. Recently, it has been shown that, under some conditions, the parameter values obtained using GD achieve…

Machine Learning · Computer Science 2021-05-17 Siddhartha Satpathi , R Srikant

A Convergence Theory for Deep Learning via Over-Parameterization

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural…

Machine Learning · Computer Science 2019-06-18 Zeyuan Allen-Zhu , Yuanzhi Li , Zhao Song

On the optimization and generalization of overparameterized implicit neural networks

Implicit neural networks have become increasingly attractive in the machine learning community since they can achieve competitive performance but use much less computational resources. Recently, a line of theoretical works established the…

Machine Learning · Computer Science 2022-10-03 Tianxiang Gao , Hongyang Gao

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

We study the training and generalization of deep neural networks (DNNs) in the over-parameterized regime, where the network width (i.e., number of hidden nodes per layer) is much larger than the number of training data points. We show that,…

Machine Learning · Computer Science 2019-11-13 Yuan Cao , Quanquan Gu

Regularization Matters: A Nonparametric Perspective on Overparametrized Neural Network

Overparametrized neural networks trained by gradient descent (GD) can provably overfit any training data. However, the generalization guarantee may not hold for noisy data. From a nonparametric perspective, this paper studies how well…

Machine Learning · Statistics 2021-09-28 Tianyang Hu , Wenjia Wang , Cong Lin , Guang Cheng

Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets

Overparameterization in deep learning typically refers to settings where a trained neural network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of…

Machine Learning · Computer Science 2023-03-24 Edo Cohen-Karlik , Itamar Menuhin-Gruman , Raja Giryes , Nadav Cohen , Amir Globerson

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values…

Machine Learning · Computer Science 2021-11-10 Kaifeng Lyu , Zhiyuan Li , Runzhe Wang , Sanjeev Arora

Gradient Descent Optimizes Infinite-Depth ReLU Implicit Networks with Linear Widths

Implicit deep learning has recently become popular in the machine learning community since these implicit models can achieve competitive performance with state-of-the-art deep networks while using significantly less memory and computational…

Machine Learning · Computer Science 2022-05-17 Tianxiang Gao , Hongyang Gao

PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional…

Machine Learning · Computer Science 2023-07-07 Liu Yang , Jifan Zhang , Joseph Shenouda , Dimitris Papailiopoulos , Kangwook Lee , Robert D. Nowak

Towards Better Generalization: Weight Decay Induces Low-rank Bias for Neural Networks

We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is…

Machine Learning · Computer Science 2024-10-04 Ke Chen , Chugang Yi , Haizhao Yang

Bad Global Minima Exist and SGD Can Reach Them

Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the…

Machine Learning · Computer Science 2021-02-24 Shengchao Liu , Dimitris Papailiopoulos , Dimitris Achlioptas