Related papers: A block-random algorithm for learning on distribut…
When using stochastic gradient descent to solve large-scale machine learning problems, a common practice of data processing is to shuffle the training data, partition the data across multiple machines if needed, and then perform several…
Motivated by extreme multi-label classification applications, we consider training deep learning models over sparse data in multi-GPU servers. The variance in the number of non-zero features across training batches and the intrinsic GPU…
The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. However, this strategy does not employ effectively the…
Shuffling strategies for stochastic gradient descent (SGD), including incremental gradient, shuffle-once, and random reshuffling, are supported by rigorous convergence analyses for arbitrary within-epoch permutations. In particular, random…
Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD…
We propose Multi-Level Local SGD, a distributed gradient method for learning a smooth, non-convex objective in a heterogeneous multi-level network. Our network model consists of a set of disjoint sub-networks, with a single hub and multiple…
Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions…
Due to the massive size of the neural network models and training datasets used in machine learning today, it is imperative to distribute stochastic gradient descent (SGD) by splitting up tasks such as gradient evaluation across multiple…
In this paper, we propose a generic and simple strategy for utilizing stochastic gradient information in optimization. The technique essentially contains two consecutive steps in each iteration: 1) computing and normalizing each block…
Distributed machine learning has been widely studied in the literature to scale up machine learning model training in the presence of an ever-increasing amount of data. We study distributed machine learning from another perspective, where…
Traditional Differentially Private Stochastic Gradient Descent (DP-SGD) introduces statistical noise on top of gradients drawn from a Gaussian distribution to ensure privacy. This paper introduces the novel Differentially Private Block-wise…
Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine learning, representing the optimization backbone for training several classic models, from regression to neural networks. Given the recent practical focus on…
With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on…
In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a…
Stochastic Gradient Descent (SGD) is the most popular algorithm for training deep neural networks (DNNs). As larger networks and datasets cause longer training times, training on distributed systems is common and distributed SGD variants,…
When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed…
One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A…
Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Synchronous Stochastic Gradient Descent (SSGD) 1 is the de facto DDL optimization method. Using a sufficiently large batch size is critical to…
Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing…
Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent…