Related papers: Saturn: Efficient Multi-Large-Model Deep Learning

Saturn: An Optimized Data System for Large Model Deep Learning Workloads

Large language models such as GPT-3 & ChatGPT have transformed deep learning (DL), powering applications that have captured the public's imagination. These models are rapidly being adopted across domains for analytics on various modalities,…

Machine Learning · Computer Science 2023-12-14 Kabir Nagrecha , Arun Kumar

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs…

Machine Learning · Computer Science 2022-11-28 Xupeng Miao , Yujie Wang , Youhe Jiang , Chunan Shi , Xiaonan Nie , Hailin Zhang , Bin Cui

Optimizing Data Curation through Spectral Analysis and Joint Batch Selection (SALN)

In modern deep learning models, long training times and large datasets present significant challenges to both efficiency and scalability. Effective data curation and sample selection are crucial for optimizing the training process of deep…

Machine Learning · Computer Science 2024-12-24 Mohammadreza Sharifi

SATURN: SAT-based Reinforcement Learning to Unleash LLMs Reasoning

How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer…

Machine Learning · Computer Science 2026-03-11 Huanyu Liu , Ge Li , Jia Li , Hao Zhu , Kechi Zhang , Yihong Dong

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Model-Parallel Model Selection for Deep Learning Systems

As deep learning becomes more expensive, both in terms of time and compute, inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users. The newest model architectures are simply too…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-15 Kabir Nagrecha

Saturn Platform: Foundation Model Operations and Generative AI for Financial Services

Saturn is an innovative platform that assists Foundation Model (FM) building and its integration with IT operations (Ops). It is custom-made to meet the requirements of data scientists, enabling them to effectively create and implement FMs…

Artificial Intelligence · Computer Science 2023-12-14 Antonio J. G. Busson , Rennan Gaio , Rafael H. Rocha , Francisco Evangelista , Bruno Rizzi , Luan Carvalho , Rafael Miceli , Marcos Rabaioli , David Favaro

Estudio de la eficiencia en la escalabilidad de GPUs para el entrenamiento de Inteligencia Artificial

Training large-scale deep learning models has become a key challenge for the scientific community and industry. While the massive use of GPUs can significantly speed up training times, this approach has a negative impact on efficiency. In…

Machine Learning · Computer Science 2025-09-04 David Cortes , Carlos Juiz , Belen Bermejo

Large-Scale Deep Learning Optimizations: A Comprehensive Survey

Deep learning have achieved promising results on a wide spectrum of AI applications. Larger datasets and models consistently yield better performance. However, we generally spend longer training time on more computation and communication.…

Machine Learning · Computer Science 2021-11-03 Xiaoxin He , Fuzhao Xue , Xiaozhe Ren , Yang You

Systems for Parallel and Distributed Large-Model Deep Learning Training

Deep learning (DL) has transformed applications in a variety of domains, including computer vision, natural language processing, and tabular data analysis. The search for improved DL model accuracy has led practitioners to explore…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-01-10 Kabir Nagrecha

Arena: Efficiently Training Large Models via Dynamic Scheduling and Adaptive Parallelism Co-Design

Efficiently training large-scale models (LMs) in GPU clusters involves two separate avenues: inter-job dynamic scheduling and intra-job adaptive parallelism (AP). However, existing dynamic schedulers struggle with large-model scheduling due…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-25 Chunyu Xue , Weihao Cui , Quan Chen , Chen Chen , Han Zhao , Shulai Zhang , Linmei Wang , Yan Li , Limin Xiao , Weifeng Zhang , Jing Yang , Bingsheng He , Minyi Guo

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-15 Divya Kiran Kadiyala , Saeed Rashidi , Taekyung Heo , Abhimanyu Rajeshkumar Bambhaniya , Tushar Krishna , Alexandros Daglis

Galvatron: An Automatic Distributed System for Efficient Foundation Model Training

Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Xinyi Liu , Yujie Wang , Shenhan Zhu , Fangcheng Fu , Qingshuo Liu , Guangming Lin , Bin Cui

Decentralized Training of Foundation Models in Heterogeneous Environments

Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often involving tens of thousands of GPUs running continuously for months. These models are typically trained in specialized clusters featuring fast,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-22 Binhang Yuan , Yongjun He , Jared Quincy Davis , Tianyi Zhang , Tri Dao , Beidi Chen , Percy Liang , Christopher Re , Ce Zhang

Distributed Training and Optimization Of Neural Networks

Deep learning models are yielding increasingly better performances thanks to multiple factors. To be successful, model may have large number of parameters or complex architectures and be trained on large dataset. This leads to large…

Machine Learning · Computer Science 2022-12-20 Jean-Roch Vlimant , Junqi Yin

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation…

Information Retrieval · Computer Science 2025-08-14 Junli Shao , Jing Dong , Dingzhou Wang , Kowei Shih , Dannier Li , Chengrui Zhou

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains, serving as the foundation for advanced large-scale deep learning (DL) models. However, efficiently…

Machine Learning · Computer Science 2024-09-06 Yujie Wang , Youhe Jiang , Xupeng Miao , Fangcheng Fu , Shenhan Zhu , Xiaonan Nie , Yaofeng Tu , Bin Cui

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-03 Shiqing Fan , Yi Rong , Chen Meng , Zongyan Cao , Siyu Wang , Zhen Zheng , Chuan Wu , Guoping Long , Jun Yang , Lixue Xia , Lansong Diao , Xiaoyong Liu , Wei Lin

The Case for Co-Designing Model Architectures with Hardware

While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-02-01 Quentin Anthony , Jacob Hatef , Deepak Narayanan , Stella Biderman , Stas Bekman , Junqi Yin , Aamir Shafi , Hari Subramoni , Dhabaleswar Panda

LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-27 Diandian Gu , Peng Sun , Qinghao Hu , Ting Huang , Xun Chen , Yingtong Xiong , Guoteng Wang , Qiaoling Chen , Shangchun Zhao , Jiarui Fang , Yonggang Wen , Tianwei Zhang , Xin Jin , Xuanzhe Liu