English
Related papers

Related papers: AdaScale SGD: A User-Friendly Algorithm for Distri…

200 papers

The choice of batch sizes in minibatch stochastic gradient optimizers is critical in large-scale model training for both optimization and generalization performance. Although large-batch training is arguably the dominant training paradigm…

Machine Learning · Computer Science 2024-05-29 Tim Tsz-Kit Lau , Han Liu , Mladen Kolar

The goal of this paper is to accelerate the training of machine learning models, a critical challenge since the training of large-scale deep neural models can be computationally expensive. Stochastic gradient descent (SGD) and its variants…

Machine Learning · Computer Science 2025-09-22 Yuen Chen , Yian Wang , Hari Sundaram

Stochastic gradient decent~(SGD) and its variants, including some accelerated variants, have become popular for training in machine learning. However, in all existing SGD and its variants, the sample size in each iteration~(epoch) of…

Machine Learning · Statistics 2019-09-18 Shen-Yi Zhao , Hao Gao , Wu-Jun Li

The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to…

Machine Learning · Computer Science 2020-06-12 Mahdi S. Hosseini , Konstantinos N. Plataniotis

Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-05 Saeed Maleki , Madan Musuvathi , Todd Mytkowicz , Olli Saarikivi , Tianju Xu , Vadim Eksarevskiy , Jaliya Ekanayake , Emad Barsoum

This paper presents a novel adaptation of the Stochastic Gradient Descent (SGD), termed AdaBatchGrad. This modification seamlessly integrates an adaptive step size with an adjustable batch size. An increase in batch size and a decrease in…

Machine Learning · Computer Science 2024-02-09 Petr Ostroukhov , Aigerim Zhumabayeva , Chulu Xiang , Alexander Gasnikov , Martin Takáč , Dmitry Kamzolov

Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer…

Machine Learning · Computer Science 2018-02-15 Aditya Devarakonda , Maxim Naumov , Michael Garland

Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it…

Machine Learning · Computer Science 2017-04-10 Soham De , Abhay Yadav , David Jacobs , Tom Goldstein

Stochastic gradient descent (SGD) is the main approach for training deep networks: it moves towards the optimum of the cost function by iteratively updating the parameters of a model in the direction of the gradient of the loss evaluated on…

Machine Learning · Computer Science 2021-03-30 Loris Nanni , Gianluca Maguolo , Alessandra Lumini

Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training…

Computation and Language · Computer Science 2021-10-22 Xiaodong Cui , Wei Zhang , Abdullah Kayi , Mingrui Liu , Ulrich Finkler , Brian Kingsbury , George Saon , David Kung

The most straightforward method to accelerate Stochastic Gradient Descent (SGD) computation is to distribute the randomly selected batch of inputs over multiple processors. To keep the distributed processors fully utilized requires…

Machine Learning · Computer Science 2020-01-06 Zhewei Yao , Amir Gholami , Daiyaan Arfeen , Richard Liaw , Joseph Gonzalez , Kurt Keutzer , Michael Mahoney

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when…

Machine Learning · Computer Science 2020-06-11 Tao Lin , Lingjing Kong , Sebastian U. Stich , Martin Jaggi

Increasing the batch size during training -- a ''batch ramp'' -- is a promising strategy to accelerate large language model pretraining. While for SGD, doubling the batch size can be equivalent to halving the learning rate, the optimal…

Machine Learning · Computer Science 2025-10-17 Alexandru Meterez , Depen Morwani , Jingfeng Wu , Costin-Andrei Oncescu , Cengiz Pehlevan , Sham Kakade

Stochastic Gradient Descent (SGD) and its variants are almost universally used to train neural networks and to fit a variety of other parametric models. An important hyperparameter in this context is the batch size, which determines how…

Optimization and Control · Mathematics 2023-12-05 Stefan Perko

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have been proposed to achieve a rapid training process with an element-wise scaling term on learning rates. Though prevailing, they are observed to generalize poorly compared…

Machine Learning · Computer Science 2019-04-22 Liangchen Luo , Yuanhao Xiong , Yan Liu , Xu Sun

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on…

Machine Learning · Computer Science 2013-03-28 Tom Schaul , Yann LeCun

Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions…

Machine Learning · Statistics 2022-10-07 Saad Mohamad , Hamad Alamri , Abdelhamid Bouchachia

Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient…

Machine Learning · Computer Science 2026-05-14 Ammar Mahran , Artavazd Maranjyan , Peter Richtárik

Mini-batch stochastic gradient descent (SGD) and variants thereof approximate the objective function's gradient with a small number of training examples, aka the batch size. Small batch sizes require little computation for each model update…

Machine Learning · Computer Science 2023-09-28 Scott Sievert , Shrey Shah

We study a new aggregation operator for gradients coming from a mini-batch for stochastic gradient (SG) methods that allows a significant speed-up in the case of sparse optimization problems. We call this method AdaBatch and it only…

Machine Learning · Computer Science 2017-11-07 Alexandre Défossez , Francis Bach
‹ Prev 1 2 3 10 Next ›