Related papers: Verify Distributed Deep Learning Model Implementat…
Traditional source separation approaches train deep neural network models end-to-end with all the data available at once by minimizing the empirical risk on the whole training set. On the inference side, after training the model, the user…
With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing…
Training Large Language Models(LLMs) is one of the most compute-intensive tasks in high-performance computing. Predicting end-to-end training time for multi-billion parameter models distributed across hundreds of GPUs remains challenging…
Deep learning (DL) has become a key component of modern software. In the "big model" era, the rich features of DL-based software substantially rely on powerful DL models, e.g., BERT, GPT-3, and the recently emerging GPT-4, which are trained…
Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…
In this paper, we question the rationale behind propagating large numbers of parameters through a distributed system during federated learning. We start by examining the rank characteristics of the subspace spanned by gradients across…
Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which…
In today's data-driven era, deep learning is vital for processing massive datasets, yet single-device training is constrained by computational and memory limits. Distributed deep learning overcomes these challenges by leveraging multiple…
Distributed learning provides an attractive framework for scaling the learning task by sharing the computational load over multiple nodes in a network. Here, we investigate the performance of distributed learning for large-scale linear…
Deep learning has emerged as a powerful method for extracting valuable information from large volumes of data. However, when new training data arrives continuously (i.e., is not fully available from the beginning), incremental training…
Embedded distributed inference of Neural Networks has emerged as a promising approach for deploying machine-learning models on resource-constrained devices in an efficient and scalable manner. The inference task is distributed across a…
Deep Learning (DL) applications are being used to solve problems in critical domains (e.g., autonomous driving or medical diagnosis systems). Thus, developers need to debug their systems to ensure that the expected behavior is delivered.…
Large Language Models (LLMs) for code generation can replicate insecure patterns from their training data. To mitigate this, a common strategy for security hardening is to fine-tune models using supervision derived from the final…
The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…
This paper presents Rudra, a parameter server based distributed computing framework tuned for training large-scale deep neural networks. Using variants of the asynchronous stochastic gradient descent algorithm we study the impact of…
Present day machine learning is computationally intensive and processes large amounts of data. It is implemented in a distributed fashion in order to address these scalability issues. The work is parallelized across a number of computing…
Training a deep learning model on source code has gained significant traction recently. Since such models reason about vectors of numbers, source code needs to be converted to a code representation before vectorization. Numerous approaches…
Many machine learning algorithms have been developed under the assumption that data sets are already available in batch form. Yet in many application domains data is only available sequentially overtime via compute nodes in different…
Despite their wide adoption in various domains (e.g., healthcare, finance, software engineering), Deep Learning (DL)-based applications suffer from many bugs, failures, and vulnerabilities. Reproducing these bugs is essential for their…
Recent years have witnessed a surge in deep learning research, marked by the introduction of expansive generative models like OpenAI's SORA and GPT, Meta AI's LLAMA series, and Google's FLAN, BART, and Gemini models. However, the rapid…