Related papers: Distributed stochastic optimization for deep learn…

Deep learning with Elastic Averaging SGD

We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among…

Machine Learning · Computer Science 2015-10-27 Sixin Zhang , Anna Choromanska , Yann LeCun

Staleness-aware Async-SGD for Distributed Deep Learning

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD…

Machine Learning · Computer Science 2016-04-06 Wei Zhang , Suyog Gupta , Xiangru Lian , Ji Liu

A Dynamic Weighting Strategy to Mitigate Worker Node Failure in Distributed Deep Learning

The increasing complexity of deep learning models and the demand for processing vast amounts of data make the utilization of large-scale distributed systems for efficient training essential. These systems, however, face significant…

Machine Learning · Computer Science 2024-09-17 Yuesheng Xu , Arielle Carr

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource…

Machine Learning · Computer Science 2019-05-03 Haibin Lin , Hang Zhang , Yifei Ma , Tong He , Zhi Zhang , Sheng Zha , Mu Li

Distributed Stochastic Optimization via Adaptive SGD

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent…

Machine Learning · Statistics 2018-10-30 Ashok Cutkosky , Robert Busa-Fekete

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Asynchronous distributed algorithms are a popular way to reduce synchronization costs in large-scale optimization, and in particular for neural network training. However, for nonsmooth and nonconvex objectives, few convergence guarantees…

Optimization and Control · Mathematics 2020-07-14 Vyacheslav Kungurtsev , Malcolm Egan , Bapi Chatterjee , Dan Alistarh

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads…

Machine Learning · Computer Science 2025-02-10 Cabrel Teguemne Fokam , Khaleelulla Khan Nazeer , Lukas König , David Kappel , Anand Subramoney

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly…

Machine Learning · Computer Science 2020-05-15 Yemao Xu , Dezun Dong , Weixia Xu , Xiangke Liao

Distributed stochastic optimization with large delays

One of the most widely used methods for solving large-scale stochastic optimization problems is distributed asynchronous stochastic gradient descent (DASGD), a family of algorithms that result from parallelizing stochastic gradient descent…

Optimization and Control · Mathematics 2021-07-08 Zhengyuan Zhou , Panayotis Mertikopoulos , Nicholas Bambos , Peter W. Glynn , Yinyu Ye

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models: Extension

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by…

Machine Learning · Computer Science 2022-04-29 Yunfei Teng , Wenbo Gao , Francois Chalus , Anna Choromanska , Donald Goldfarb , Adrian Weller

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS).…

Machine Learning · Computer Science 2016-11-15 Peter H. Jin , Qiaochu Yuan , Forrest Iandola , Kurt Keutzer

Asynchronous Distributed Semi-Stochastic Gradient Optimization

With the recent proliferation of large-scale learning problems,there have been a lot of interest on distributed machine learning algorithms, particularly those that are based on stochastic gradient descent (SGD) and its variants. However,…

Machine Learning · Computer Science 2015-12-07 Ruiliang Zhang , Shuai Zheng , James T. Kwok

Asynchronous Decentralized Distributed Training of Acoustic Models

Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training…

Computation and Language · Computer Science 2021-10-22 Xiaodong Cui , Wei Zhang , Abdullah Kayi , Mingrui Liu , Ulrich Finkler , Brian Kingsbury , George Saon , David Kung

A continuous-time analysis of distributed stochastic gradient

We analyze the effect of synchronization on distributed stochastic gradient algorithms. By exploiting an analogy with dynamical models of biological quorum sensing - where synchronization between agents is induced through communication with…

Optimization and Control · Mathematics 2020-12-18 Nicholas M. Boffi , Jean-Jacques E. Slotine

Scaling up Stochastic Gradient Descent for Non-convex Optimisation

Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions…

Machine Learning · Statistics 2022-10-07 Saad Mohamad , Hamad Alamri , Abdelhamid Bouchachia

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient…

Machine Learning · Computer Science 2026-05-14 Ammar Mahran , Artavazd Maranjyan , Peter Richtárik

Reinforced stochastic gradient descent for deep neural network learning

Stochastic gradient descent (SGD) is a standard optimization method to minimize a training error with respect to network parameters in modern neural network learning. However, it typically suffers from proliferation of saddle points in the…

Machine Learning · Computer Science 2017-11-23 Haiping Huang , Taro Toyoizumi

The Optimality of (Accelerated) SGD for High-Dimensional Quadratic Optimization

Stochastic gradient descent (SGD) is a widely used algorithm in machine learning, particularly for neural network training. Recent studies on SGD for canonical quadratic optimization or linear regression show it attains well generalization…

Machine Learning · Computer Science 2024-09-17 Haihan Zhang , Yuanshi Liu , Qianwen Chen , Cong Fang

Online Learning to Sample

Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning. In this work, we accelerate SGD by adaptively learning how to sample the most useful training examples at each time…

Machine Learning · Computer Science 2016-03-16 Guillaume Bouchard , Théo Trouillon , Julien Perez , Adrien Gaidon

A Bias-Correction Decentralized Stochastic Gradient Algorithm with Momentum Acceleration

Distributed stochastic optimization algorithms can simultaneously process large-scale datasets, significantly accelerating model training. However, their effectiveness is often hindered by the sparsity of distributed networks and data…

Machine Learning · Computer Science 2025-02-14 Yuchen Hu , Xi Chen , Weidong Liu , Xiaojun Mao