Related papers: Coordinating Distributed Example Orders for Provab…

GraB: Finding Provably Better Data Permutations than Random Reshuffling

Random reshuffling, which randomly permutes the dataset each epoch, is widely adopted in model training because it yields faster convergence than with-replacement sampling. Recent studies indicate greedily chosen data orderings can further…

Machine Learning · Computer Science 2023-01-05 Yucheng Lu , Wentao Guo , Christopher De Sa

GraB-sampler: Optimal Permutation-based SGD Data Sampler for PyTorch

The online Gradient Balancing (GraB) algorithm greedily choosing the examples ordering by solving the herding problem using per-sample gradients is proved to be the theoretically optimal solution that guarantees to outperform Random…

Machine Learning · Computer Science 2023-10-02 Guanghao Wei

Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge…

Machine Learning · Computer Science 2022-02-07 Amirkeivan Mohtashami , Sebastian Stich , Martin Jaggi

On the Utility of Gradient Compression in Distributed Training Systems

A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-01 Saurabh Agarwal , Hongyi Wang , Shivaram Venkataraman , Dimitris Papailiopoulos

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A…

Information Theory · Computer Science 2021-03-02 Baturalp Buyukates , Emre Ozfatura , Sennur Ulukus , Deniz Gunduz

GRAWA: Gradient-based Weighted Averaging for Distributed Training of Deep Learning Models

We study distributed training of deep learning models in time-constrained environments. We propose a new algorithm that periodically pulls workers towards the center variable computed as a weighted average of workers, where the weights are…

Machine Learning · Computer Science 2024-03-08 Tolga Dimlioglu , Anna Choromanska

Asynchronous Distributed Semi-Stochastic Gradient Optimization

With the recent proliferation of large-scale learning problems,there have been a lot of interest on distributed machine learning algorithms, particularly those that are based on stochastic gradient descent (SGD) and its variants. However,…

Machine Learning · Computer Science 2015-12-07 Ruiliang Zhang , Shuai Zheng , James T. Kwok

Variance Reduction in SGD by Distributed Importance Sampling

Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search…

Machine Learning · Statistics 2016-04-19 Guillaume Alain , Alex Lamb , Chinnadhurai Sankar , Aaron Courville , Yoshua Bengio

Federated Optimization Algorithms with Random Reshuffling and Gradient Compression

Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of…

Machine Learning · Computer Science 2022-11-04 Abdurakhmon Sadiev , Grigory Malinovsky , Eduard Gorbunov , Igor Sokolov , Ahmed Khaled , Konstantin Burlachenko , Peter Richtárik

Stochastic Re-weighted Gradient Descent via Distributionally Robust Optimization

We present Re-weighted Gradient Descent (RGD), a novel optimization technique that improves the performance of deep neural networks through dynamic sample re-weighting. Leveraging insights from distributionally robust optimization (DRO)…

Machine Learning · Computer Science 2024-10-15 Ramnath Kumar , Kushal Majmundar , Dheeraj Nagaraj , Arun Sai Suggala

Gradient Coding with Clustering and Multi-message Communication

Gradient descent (GD) methods are commonly employed in machine learning problems to optimize the parameters of the model in an iterative fashion. For problems with massive datasets, computations are distributed to many parallel computing…

Information Theory · Computer Science 2019-03-06 Emre Ozfatura , Deniz Gunduz , Sennur Ulukus

Provable Advantage of Curriculum Learning on Parity Targets with Mixed Inputs

Experimental results have shown that curriculum learning, i.e., presenting simpler examples before more complex ones, can improve the efficiency of learning. Some recent theoretical results also showed that changing the sampling…

Machine Learning · Computer Science 2023-06-30 Emmanuel Abbe , Elisabetta Cornacchia , Aryo Lotfi

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can…

Machine Learning · Statistics 2018-05-11 Sanghamitra Dutta , Gauri Joshi , Soumyadip Ghosh , Parijat Dube , Priya Nagpurkar

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various applications. For data-parallel based distributed learning, the inter-node gradient communication often becomes the performance bottleneck. In this paper, we propose the…

Computer Vision and Pattern Recognition · Computer Science 2018-06-22 Jiaxiang Wu , Weidong Huang , Junzhou Huang , Tong Zhang

Slow and Stale Gradients Can Win the Race

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness…

Machine Learning · Statistics 2020-03-25 Sanghamitra Dutta , Jianyu Wang , Gauri Joshi

Constrained Deep Learning using Conditional Gradient and Applications in Computer Vision

A number of results have recently demonstrated the benefits of incorporating various constraints when training deep architectures in vision and machine learning. The advantages range from guarantees for statistical generalization to better…

Machine Learning · Computer Science 2019-05-27 Sathya N. Ravi , Tuan Dinh , Vishnu Lokhande , Vikas Singh

GRAB: An LLM-Inspired Sequence-First Click-Through Rate Prediction Modeling Paradigm

Traditional Deep Learning Recommendation Models (DLRMs) face increasing bottlenecks in performance and efficiency, often struggling with generalization and long-sequence modeling. Inspired by the scaling success of Large Language Models…

Information Retrieval · Computer Science 2026-02-04 Shaopeng Chen , Chuyue Xie , Huimin Ren , Shaozong Zhang , Han Zhang , Ruobing Cheng , Zhiqiang Cao , Zehao Ju , Yu Gao , Jie Ding , Xiaodong Chen , Xuewu Jiao , Shuanglong Li , Liu Lin

Aiding Global Convergence in Federated Learning via Local Perturbation and Mutual Similarity Information

Federated learning has emerged in the last decade as a distributed optimization paradigm due to the rapidly increasing number of portable devices able to support the heavy computational needs related to the training of machine learning…

Machine Learning · Computer Science 2024-10-10 Emanuel Buttaci , Giuseppe Carlo Calafiore

Toward Communication Efficient Adaptive Gradient Method

In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks. With the increasing computation power of GPUs, the bottleneck of…

Machine Learning · Computer Science 2021-09-14 Xiangyi Chen , Xiaoyun Li , Ping Li

Implicit Gradient Alignment in Distributed and Federated Learning

A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. In this work, we show…

Machine Learning · Computer Science 2021-12-14 Yatin Dandi , Luis Barba , Martin Jaggi