Related papers: Communication-Efficient Distributed Deep Learning:…

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-10 Feng Liang , Zhen Zhang , Haifeng Lu , Victor C. M. Leung , Yanyi Guo , Xiping Hu

A Quantitative Survey of Communication Optimizations in Distributed Deep Learning

Nowadays, large and complex deep learning (DL) models are increasingly trained in a distributed manner across multiple worker machines, in which extensive communications between workers pose serious scaling problems. In this article, we…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-10 Shaohuai Shi , Zhenheng Tang , Xiaowen Chu , Chengjian Liu , Wei Wang , Bo Li

Communication Optimization for Distributed Training: Architecture, Advances, and Opportunities

The past few years have witnessed the flourishing of large-scale deep neural network models with ever-growing parameter numbers. Training such large-scale models typically requires massive memory and computing resources, necessitating…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-30 Yunze Wei , Tianshuo Hu , Cong Liang , Yong Cui

Communication-Efficient Distributed Deep Learning via Federated Dynamic Averaging

The ever-growing volume and decentralized nature of data, coupled with the need to harness it and extract knowledge, have led to the extensive use of distributed deep learning (DDL) techniques for training. These techniques rely on local…

Machine Learning · Computer Science 2024-11-22 Michail Theologitis , Georgios Frangias , Georgios Anestis , Vasilis Samoladas , Antonios Deligiannakis

Communication optimization strategies for distributed deep neural network training: A survey

Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slows…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-01-27 Shuo Ouyang , Dezun Dong , Yemao Xu , Liquan Xiao

Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools

Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-art results in various domains such as image recognition and natural language processing. One of the reasons for this success is the increasing size…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-09-26 Ruben Mayer , Hans-Arno Jacobsen

Demystifying the Communication Characteristics for Distributed Transformer Models

Deep learning (DL) models based on the transformer architecture have revolutionized many DL applications such as large language models (LLMs), vision transformers, audio generation, and time series prediction. Much of this progress has been…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-20 Quentin Anthony , Benjamin Michalowicz , Jacob Hatef , Lang Xu , Mustafa Abduljabbar , Aamir Shafi , Hari Subramoni , Dhabaleswar Panda

Systems for Parallel and Distributed Large-Model Deep Learning Training

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

Joint Communication Scheduling and Resource Allocation for Distributed Edge Learning: Seamless Integration in Next-Generation Wireless Networks

Distributed edge learning (DL) is considered a cornerstone of intelligence enablers, since it allows for collaborative training without the necessity for local clients to share raw data with other parties, thereby preserving privacy and…

Systems and Control · Electrical Eng. & Systems 2026-01-15 Paul Zheng , Navid Keshtiarast , Pradyumna Kumar Bishoyi , Yao Zhu , Yulin Hu , Marina Petrova , Anke Schmeink

Data optimization for large batch distributed training of deep neural networks

Distributed training in deep learning (DL) is common practice as data and models grow. The current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale, and…

Machine Learning · Computer Science 2020-12-21 Shubhankar Gahlot , Junqi Yin , Mallikarjun Shankar

Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis

Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we…

Machine Learning · Computer Science 2018-09-18 Tal Ben-Nun , Torsten Hoefler

High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates

As the size of datasets used in statistical learning continues to grow, distributed training of models has attracted increasing attention. These methods partition the data and exploit parallelism to reduce memory and runtime, but suffer…

Machine Learning · Computer Science 2024-07-10 Fred Lu , Ryan R. Curtin , Edward Raff , Francis Ferraro , James Holt

Distributed Learning with Sublinear Communication

In distributed statistical learning, $N$ samples are split across $m$ machines and a learner wishes to use minimal communication to learn as well as if the examples were on a single machine. This model has received substantial interest in…

Machine Learning · Computer Science 2019-03-19 Jayadev Acharya , Christopher De Sa , Dylan J. Foster , Karthik Sridharan

A Survey of Distributed Learning in Cloud, Mobile, and Edge Settings

In the era of deep learning (DL), convolutional neural networks (CNNs), and large language models (LLMs), machine learning (ML) models are becoming increasingly complex, demanding significant computational resources for both inference and…

Machine Learning · Computer Science 2024-05-27 Madison Threadgill , Andreas Gerstlauer

Domain-specific Communication Optimization for Distributed DNN Training

Communication overhead poses an important obstacle to distributed DNN training and draws increasing attention in recent years. Despite continuous efforts, prior solutions such as gradient compression/reduction, compute/communication…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-20 Hao Wang , Jingrong Chen , Xinchen Wan , Han Tian , Jiacheng Xia , Gaoxiong Zeng , Weiyan Wang , Kai Chen , Wei Bai , Junchen Jiang

Resource Allocation and Workload Scheduling for Large-Scale Distributed Deep Learning: A Survey

With rapidly increasing distributed deep learning workloads in large-scale data centers, efficient distributed deep learning framework strategies for resource allocation and workload scheduling have become the key to high-performance deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Feng Liang , Zhen Zhang , Haifeng Lu , Chengming Li , Victor C. M. Leung , Yanyi Guo , Xiping Hu

Communication-Efficient Algorithms For Distributed Optimization

This thesis is concerned with the design of distributed algorithms for solving optimization problems. We consider networks where each node has exclusive access to a cost function, and design algorithms that make all nodes cooperate to find…

Optimization and Control · Mathematics 2013-12-03 João F. C. Mota

Decentralized Deep Learning for Multi-Access Edge Computing: A Survey on Communication Efficiency and Trustworthiness

Wider coverage and a better solution to a latency reduction in 5G necessitate its combination with multi-access edge computing (MEC) technology. Decentralized deep learning (DDL) such as federated learning and swarm learning as a promising…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-03-23 Yuwei Sun , Hideya Ochiai , Hiroshi Esaki

Communication-Efficient Learning of Deep Networks from Decentralized Data

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image…

Machine Learning · Computer Science 2023-01-30 H. Brendan McMahan , Eider Moore , Daniel Ramage , Seth Hampson , Blaise Agüera y Arcas