English
Related papers

Related papers: Nesterov Method for Asynchronous Pipeline Parallel…

200 papers

Pipeline parallelism (PP) when training neural networks enables larger models to be partitioned spatially, leading to both lower network communication and overall higher hardware utilization. Unfortunately, to preserve the statistical…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-11 Bowen Yang , Jian Zhang , Jonathan Li , Christopher Ré , Christopher R. Aberger , Christopher De Sa

Nesterov's accelerated gradient method (NAG) is widely used in problems with machine learning background including deep learning, and is corresponding to a continuous-time differential equation. From this connection, the property of the…

Optimization and Control · Mathematics 2022-04-05 Yasong Feng , Weiguo Gao

Data and pipeline parallelism are key strategies for scaling neural network training across distributed devices, but their high communication cost necessitates co-located computing clusters with fast interconnects, limiting their…

Momentum methods, including heavy-ball~(HB) and Nesterov's accelerated gradient~(NAG), are widely used in training neural networks for their fast convergence. However, there is a lack of theoretical guarantees for their convergence and…

Machine Learning · Computer Science 2022-04-19 Xin Liu , Wei Tao , Zhisong Pan

In this paper, we propose Nesterov Accelerated Shuffling Gradient (NASG), a new algorithm for the convex finite-sum minimization problems. Our method integrates the traditional Nesterov's acceleration momentum with different shuffling…

Optimization and Control · Mathematics 2022-06-14 Trang H. Tran , Katya Scheinberg , Lam M. Nguyen

Momentum methods, such as heavy ball method~(HB) and Nesterov's accelerated gradient method~(NAG), have been widely used in training neural networks by incorporating the history of gradients into the current updating process. In practice,…

Machine Learning · Computer Science 2022-04-19 Xin Liu , Zhisong Pan , Wei Tao

We investigate the theoretical limits of pipeline parallel learning of deep learning architectures, a distributed setup in which the computation is distributed per layer instead of per example. For smooth convex and non-convex objective…

Machine Learning · Statistics 2019-10-14 Igor Colin , Ludovic Dos Santos , Kevin Scaman

We study the convergence rate of first-order methods for rectangular matrix factorization, which is a canonical nonconvex optimization problem. Specifically, given a rank-$r$ matrix $\mathbf{A}\in\mathbb{R}^{m\times n}$, we prove that…

Machine Learning · Computer Science 2024-12-03 Zhenghao Xu , Yuqing Wang , Tuo Zhao , Rachel Ward , Molei Tao

The Nesterov accelerated gradient (NAG) method is an important extrapolation-based numerical algorithm that accelerates the convergence of the gradient descent method in convex optimization. When dealing with an objective function that is…

Optimization and Control · Mathematics 2025-05-28 Chenglong Bao , Liang Chen , Jiahong Li

To train modern large DNN models, pipeline parallelism has recently emerged, which distributes the model across GPUs and enables different devices to process different microbatches in pipeline. Earlier pipeline designs allow multiple…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-23 Ziyue Luo , Xiaodong Yi , Guoping Long , Shiqing Fan , Chuan Wu , Jun Yang , Wei Lin

We develop an adaptive Nesterov accelerated proximal gradient (adaNAPG) algorithm for stochastic composite optimization problems, boosting the Nesterov accelerated proximal gradient (NAPG) algorithm through the integration of an adaptive…

Optimization and Control · Mathematics 2025-07-25 Dongxuan Zhu , Weihuan Huang , Caihua Chen

Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-29 Ling Chen , Houming Wu , Wenjie Yu

The proximal point method (PPM) is a fundamental method in optimization that is often used as a building block for designing optimization algorithms. In this work, we use the PPM method to provide conceptually simple derivations along with…

Optimization and Control · Mathematics 2022-06-03 Kwangjun Ahn , Suvrit Sra

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

We present a totally asynchronous algorithm for convex optimization that is based on a novel generalization of Nesterov's accelerated gradient method. This algorithm is developed for fast convergence under "total asynchrony," i.e., allowing…

Optimization and Control · Mathematics 2024-06-17 Ellie Pond , April Sebok , Zachary Bell , Matthew Hale

Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-29 Byungsoo Jeon , Mengdi Wu , Shiyi Cao , Sunghyun Kim , Sunghyun Park , Neeraj Aggarwal , Colin Unger , Daiyaan Arfeen , Peiyuan Liao , Xupeng Miao , Mohammad Alizadeh , Gregory R. Ganger , Tianqi Chen , Zhihao Jia

Recent studies incorporate Nesterov's accelerated gradient method for the acceleration of gradient based training. The Nesterov's Accelerated Quasi-Newton (NAQ) method has shown to drastically improve the convergence speed compared to the…

Machine Learning · Computer Science 2020-10-16 S. Indrapriyadarsini , Shahrzad Mahboubi , Hiroshi Ninomiya , Hideki Asai

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In…

Optimization and Control · Mathematics 2023-10-03 Valentin Leplat , Daniil Merkulov , Aleksandr Katrutsa , Daniel Bershatsky , Olga Tsymboi , Ivan Oseledets

Due to its simplicity and efficiency, the first-order gradient method has been extensively employed in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the…

Machine Learning · Computer Science 2024-05-09 Xin Liu , Wei Tao , Wei Li , Dazhi Zhan , Jun Wang , Zhisong Pan

A novel dynamical inertial Newton system, which is called Hessian-driven Nesterov accelerated gradient (H-NAG) flow is proposed. Convergence of the continuous trajectory are established via tailored Lyapunov function, and new first-order…

Optimization and Control · Mathematics 2019-12-25 Long Chen , Hao Luo
‹ Prev 1 2 3 10 Next ›