Related papers: Nesterov Method for Asynchronous Pipeline Parallel…
Pipeline parallelism (PP) when training neural networks enables larger models to be partitioned spatially, leading to both lower network communication and overall higher hardware utilization. Unfortunately, to preserve the statistical…
Nesterov's accelerated gradient method (NAG) is widely used in problems with machine learning background including deep learning, and is corresponding to a continuous-time differential equation. From this connection, the property of the…
Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their…
Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and…
In this paper, we propose Nesterov Accelerated Shuffling Gradient (NASG), a new algorithm for the convex finite-sum minimization problems. Our method integrates the traditional Nesterov's acceleration momentum with different shuffling…
Momentum methods, such as heavy ball method~(HB) and Nesterov's accelerated gradient method~(NAG), have been widely used in training neural networks by incorporating the history of gradients into the current updating process. In practice,…
We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective…
We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that…
The Nesterov accelerated gradient (NAG) method is an important extrapolation-based numerical algorithm that accelerates the convergence of the gradient descent method in convex optimization. When dealing with an objective function that is…
To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…
We develop an adaptive Nesterov accelerated proximal gradient (adaNAPG) algorithm for stochastic composite optimization problems, boosting the Nesterov accelerated proximal gradient (NAPG) algorithm through the integration of an adaptive…
Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional…
The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with…
It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…
We present a totally asynchronous algorithm for convex optimization that is based on a novel generalization of Nesterov's accelerated gradient method. This algorithm is developed for fast convergence under "total asynchrony," i.e., allowing…
Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into…
Recent studies incorporate Nesterov's accelerated gradient method for the acceleration of gradient based training. The Nesterov's Accelerated Quasi-Newton (NAQ) method has shown to drastically improve the convergence speed compared to the…
Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In…
Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the…
A novel dynamical inertial Newton system, which is called Hessian-driven Nesterov accelerated gradient (H-NAG) flow is proposed. Convergence of the continuous trajectory are established via tailored Lyapunov function, and new first-order…