English
Related papers

Related papers: Implicit Gradient Alignment in Distributed and Fed…

200 papers

For infinitesimal learning rates, stochastic gradient descent (SGD) follows the path of gradient flow on the full batch loss function. However moderately large learning rates can achieve higher test accuracies, and this generalization…

Machine Learning · Computer Science 2021-01-29 Samuel L. Smith , Benoit Dherin , David G. T. Barrett , Soham De

It has been experimentally observed that distributed implementations of mini-batch stochastic gradient descent (SGD) algorithms exhibit speedup saturation and decaying generalization ability beyond a particular batch-size. In this work, we…

Machine Learning · Computer Science 2018-01-09 Dong Yin , Ashwin Pananjady , Max Lam , Dimitris Papailiopoulos , Kannan Ramchandran , Peter Bartlett

Works on implicit regularization have studied gradient trajectories during the optimization process to explain why deep networks favor certain kinds of solutions over others. In deep linear networks, it has been shown that gradient descent…

Machine Learning · Computer Science 2023-06-02 Dan Zhao

We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization. Traditional distributed data-parallel stochastic gradient descent involves averaging gradients of microbatches to…

Machine Learning · Computer Science 2024-12-31 Francois Chaubard , Duncan Eddy , Mykel J. Kochenderfer

A number of competing hypotheses have been proposed to explain why small-batch Stochastic Gradient Descent (SGD)leads to improved generalization over the full-batch regime, with recent work crediting the implicit regularization of various…

Machine Learning · Computer Science 2022-11-30 Zachary Novack , Simran Kaur , Tanya Marwah , Saurabh Garg , Zachary C. Lipton

Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient…

Machine Learning · Computer Science 2022-07-20 David G. T. Barrett , Benoit Dherin

Inspired by the remarkable success of large neural networks, there has been significant interest in understanding the generalization performance of over-parameterized models. Substantial efforts have been invested in characterizing how…

Machine Learning · Computer Science 2024-01-12 Haoyuan Sun , Khashayar Gatmiry , Kwangjun Ahn , Navid Azizan

Deep neural networks with remarkably strong generalization performances are usually over-parameterized. Despite explicit regularization strategies are used for practitioners to avoid over-fitting, the impacts are often small. Some…

Computation and Language · Computer Science 2018-11-05 Deren Lei , Zichen Sun , Yijun Xiao , William Yang Wang

Deep neural networks are a promising approach towards multi-task learning because of their capability to leverage knowledge across domains and learn general purpose representations. Nevertheless, they can fail to live up to these promises…

Machine Learning · Computer Science 2019-12-17 Mihai Suteu , Yike Guo

Gradient regularization (GR) is a method that penalizes the gradient norm of the training loss during training. While some studies have reported that GR can improve generalization performance, little attention has been paid to it from the…

Machine Learning · Computer Science 2023-02-06 Ryo Karakida , Tomoumi Takase , Tomohiro Hayase , Kazuki Osawa

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization --…

Machine Learning · Statistics 2026-05-08 Joseph H. Rudoler , Kevin Tan , Giles Hooker , Konrad P. Kording

A widely believed explanation for the remarkable generalization capacities of overparameterized neural networks is that the optimization algorithms used for training induce an implicit bias towards benign solutions. To grasp this…

Machine Learning · Computer Science 2025-12-19 Maria Matveev , Vit Fojtik , Hung-Hsu Chou , Gitta Kutyniok , Johannes Maly

Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. While the empirical success of SGD is often attributed to its computational efficiency…

Machine Learning · Statistics 2022-06-16 Courtney Paquette , Elliot Paquette , Ben Adlam , Jeffrey Pennington

Large-batch SGD is important for scaling training of deep neural networks. However, without fine-tuning hyperparameter schedules, the generalization of the model may be hampered. We propose to use batch augmentation: replicating instances…

Machine Learning · Computer Science 2019-01-29 Elad Hoffer , Tal Ben-Nun , Itay Hubara , Niv Giladi , Torsten Hoefler , Daniel Soudry

The stochastic gradient descent (SGD) method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, say $32$-$512$ data points, is…

Machine Learning · Computer Science 2017-02-13 Nitish Shirish Keskar , Dheevatsa Mudigere , Jorge Nocedal , Mikhail Smelyanskiy , Ping Tak Peter Tang

Data imbalance is a common problem in machine learning that can have a critical effect on the performance of a model. Various solutions exist but their impact on the convergence of the learning dynamics is not understood. Here, we elucidate…

Machine Learning · Statistics 2024-02-20 Emanuele Francazi , Marco Baity-Jesi , Aurelien Lucchi

While nowadays most gradient-based optimization methods focus on exploring the high-dimensional geometric features, the random error accumulated in a stochastic version of any algorithm implementation has not been stressed yet. In this…

Machine Learning · Computer Science 2020-08-14 Tong Yang , Long Sha , Pengyu Hong

In federated distributed learning, the goal is to optimize a global training objective defined over distributed devices, where the data shard at each device is sampled from a possibly different distribution (a.k.a., heterogeneous or non…

Machine Learning · Computer Science 2019-12-10 Farzin Haddadpour , Mehrdad Mahdavi

We study the implicit regularization imposed by gradient descent for learning multi-layer homogeneous functions including feed-forward fully connected and convolutional deep neural networks with linear, ReLU or Leaky ReLU activation. We…

Machine Learning · Computer Science 2018-11-01 Simon S. Du , Wei Hu , Jason D. Lee

Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low "complexity." We study the implicit…

Machine Learning · Computer Science 2019-10-29 Sanjeev Arora , Nadav Cohen , Wei Hu , Yuping Luo
‹ Prev 1 2 3 10 Next ›