Related papers: Verify Distributed Deep Learning Model Implementat…

Latent Iterative Refinement for Modular Source Separation

Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user…

Sound · Computer Science 2023-10-17 Dimitrios Bralios , Efthymios Tzinis , Gordon Wichern , Paris Smaragdis , Jonathan Le Roux

Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing…

Software Engineering · Computer Science 2025-06-13 Xiao Yu , Haoxuan Chen , Feifei Niu , Xing Hu , Jacky Wai Keung , Xin Xia

Efficient Fine-Grained GPU Performance Modeling for Distributed Deep Learning of LLM

Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-30 Biyao Zhang , Mingkai Zheng , Debargha Ganguly , Xuecen Zhang , Vikash Singh , Vipin Chaudhary , Zhao Zhang

Rise of Distributed Deep Learning Training in the Big Model Era: From a Software Engineering Perspective

Deep learning (DL) has become a key component of modern software. In the "big model" era, the rich features of DL-based software substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained…

Software Engineering · Computer Science 2023-05-01 Xuanzhe Liu , Diandian Gu , Zhenpeng Chen , Jinfeng Wen , Zili Zhang , Yun Ma , Haoyu Wang , Xin Jin

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik

Recycling Model Updates in Federated Learning: Are Gradient Subspaces Low-Rank?

In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across…

Machine Learning · Computer Science 2022-02-02 Sheikh Shams Azam , Seyyedali Hosseinalipour , Qiang Qiu , Christopher Brinton

Characterizing and Understanding Distributed GNN Training on GPUs

Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-04-19 Haiyang Lin , Mingyu Yan , Xiaocheng Yang , Mo Zou , Wenming Li , Xiaochun Ye , Dongrui Fan

A Comprehensive Study of Bugs in Modern Distributed Deep Learning Systems

In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple…

Software Engineering · Computer Science 2025-12-24 Xiaoxue Ma , Wanwei Zhan , Jiale Chen , Yishu Li , Jacky Keung , Federica Sarro

Linear Regression with Distributed Learning: A Generalization Error Perspective

Distributed learning provides an attractive framework for scaling the learning task by sharing the computational load over multiple nodes in a network. Here, we investigate the performance of distributed learning for large-scale linear…

Machine Learning · Statistics 2021-11-03 Martin Hellkvist , Ayça Özçelikkale , Anders Ahlén

Efficient Data-Parallel Continual Learning with Asynchronous Distributed Rehearsal Buffers

Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-06 Thomas Bouvier , Bogdan Nicolae , Hugo Chaugier , Alexandru Costan , Ian Foster , Gabriel Antoniu

Embedded Distributed Inference of Deep Neural Networks: A Systematic Review

Embedded distributed inference of Neural Networks has emerged as a promising approach for deploying machine-learning models on resource-constrained devices in an efficient and scalable manner. The inference task is distributed across a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-07 Federico Nicolás Peccia , Oliver Bringmann

An Effective Data-Driven Approach for Localizing Deep Learning Faults

Deep Learning (DL) applications are being used to solve problems in critical domains (e.g., autonomous driving or medical diagnosis systems). Thus, developers need to debug their systems to ensure that the expected behavior is delivered.…

Software Engineering · Computer Science 2023-07-19 Mohammad Wardat , Breno Dantas Cruz , Wei Le , Hridesh Rajan

DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation

Large Language Models (LLMs) for code generation can replicate insecure patterns from their training data. To mitigate this, a common strategy for security hardening is to fine-tune models using supervision derived from the final…

Software Engineering · Computer Science 2026-04-13 Li Huang , Zhongxin Liu , Yifan Wu , Tao Yin , Dong Li , Jichao Bi , Nankun Mu , Hongyu Zhang , Meng Yan

A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…

Machine Learning · Computer Science 2022-07-04 Daniel Nichols , Siddharth Singh , Shu-Huai Lin , Abhinav Bhatele

Model Accuracy and Runtime Tradeoff in Distributed Deep Learning:A Systematic Study

This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of…

Machine Learning · Statistics 2016-12-07 Suyog Gupta , Wei Zhang , Fei Wang

Modeling Scalability of Distributed Machine Learning

Present day machine learning is computationally intensive and processes large amounts of data. It is implemented in a distributed fashion in order to address these scalability issues. The work is parallelized across a number of computing…

Machine Learning · Computer Science 2017-03-28 Alexander Ulanov , Andrey Simanovsky , Manish Marwah

A Controlled Experiment of Different Code Representations for Learning-Based Bug Repair

Training a deep learning model on source code has gained significant traction recently. Since such models reason about vectors of numbers, source code needs to be converted to a code representation before vectorization. Numerous approaches…

Software Engineering · Computer Science 2022-07-18 Marjane Namavar , Noor Nashid , Ali Mesbah

Distributed Networked Real-time Learning

Many machine learning algorithms have been developed under the assumption that data sets are already available in batch form. Yet in many application domains data is only available sequentially overtime via compute nodes in different…

Optimization and Control · Mathematics 2020-09-10 Alfredo Garcia , Luochao Wang , Jeff Huang , Lingzhou Hong

Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent

Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their…

Software Engineering · Computer Science 2026-02-27 Mehil B Shah , Mohammad Masudur Rahman , Foutse Khomh

TDML -- A Trustworthy Distributed Machine Learning Framework

Recent years have witnessed a surge in deep learning research, marked by the introduction of expansive generative models like OpenAI's SORA and GPT, Meta AI's LLAMA series, and Google's FLAN, BART, and Gemini models. However, the rapid…

Cryptography and Security · Computer Science 2024-07-11 Zhen Wang , Qin Wang , Guangsheng Yu , Shiping Chen