Related papers: Echo: Simulating Distributed Training At Scale

Echo: Decoupling Inference and Training for Large-Scale RL Alignment on Heterogeneous Swarms

Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context…

Machine Learning · Computer Science 2025-08-13 Jie Xiao , Changyuan Fan , Qingnan Ren , Alfred Long , Yuchen Zhang , Rymon Yu , Eric Yang , Lynn Ai , Shaoduo Gan

Echo: A Large Language Model with Temporal Episodic Memory

Research on large language models (LLMs) has shown remarkable performance in domains such as mathematics, programming, and literary creation. However, most studies have focused on semantic memory-based question answering, neglecting LLMs'…

Computation and Language · Computer Science 2025-02-25 WenTao Liu , Ruohua Zhang , Aimin Zhou , Feng Gao , JiaLi Liu

ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers…

Machine Learning · Computer Science 2026-05-27 Jingwei Song , Meng Chen , Jie Xiao , Qingnan Ren , Jiaqi Huang , Yangshen Deng , Chris Tong , Wanyi Chen , Suli Wang , Zhisheng Chen , Ziqian Bi , Shuo Lu , Yiqun Duan , Xu Wang , Rymon Yu , Lynn Ai , Eric Yang , Tianyu Shi

HETHUB: A Distributed Training System with Heterogeneous Cluster for Large-Scale Models

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-12 Si Xu , Zixiao Huang , Yan Zeng , Shengen Yan , Xuefei Ning , Quanlu Zhang , Haolin Ye , Sipei Gu , Chunsheng Shui , Zhezheng Lin , Hao Zhang , Sheng Wang , Guohao Dai , Yu Wang

Scaling Distributed Machine Learning with In-Network Aggregation

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan R. K. Ports , Peter Richtárik

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-08 Mingzhen Li , Wencong Xiao , Biao Sun , Hanyu Zhao , Hailong Yang , Shiru Ren , Zhongzhi Luan , Xianyan Jia , Yi Liu , Yong Li , Wei Lin , Depei Qian

Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern…

Machine Learning · Computer Science 2025-04-15 Jared Fernandez , Luca Wehrstedt , Leonid Shamis , Mostafa Elhoushi , Kalyan Saladi , Yonatan Bisk , Emma Strubell , Jacob Kahn

Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Mingyu Liang , Hiwot Tadese Kassa , Wenyin Fu , Brian Coutinho , Louis Feng , Christina Delimitrou

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

Computation and Language · Computer Science 2021-08-25 Deepak Narayanan , Mohammad Shoeybi , Jared Casper , Patrick LeGresley , Mostofa Patwary , Vijay Anand Korthikanti , Dmitri Vainbrand , Prethvi Kashinkunti , Julie Bernauer , Bryan Catanzaro , Amar Phanishayee , Matei Zaharia

CO2: Efficient Distributed Training with Full Communication-Computation Overlap

The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication…

Computation and Language · Computer Science 2024-01-30 Weigao Sun , Zhen Qin , Weixuan Sun , Shidi Li , Dong Li , Xuyang Shen , Yu Qiao , Yiran Zhong

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-28 Dimitar Mileski , Nikola Petrovski , Marjan Gusev

Scalable Machine Learning Training Infrastructure for Online Ads Recommendation and Auction Scoring Modeling at Google

Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-22 George Kurian , Somayeh Sardashti , Ryan Sims , Felix Berger , Gary Holt , Yang Li , Jeremiah Willcock , Kaiyuan Wang , Herve Quiroz , Abdulrahman Salem , Julian Grady

Scalable Training of Mixture-of-Experts Models with Megatron Core

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-11 Zijie Yan , Hongxiao Bai , Xin Yao , Dennis Liu , Tong Liu , Hongbin Liu , Pingtian Li , Evan Wu , Shiqing Fan , Li Tao , Robin Zhang , Yuzhong Wang , Shifang Xu , Jack Chang , Xuwen Chen , Kunlun Li , Yan Bai , Gao Deng , Nan Zheng , Vijay Anand Korthikanti , Abhinav Khattar , Ethan He , Soham Govande , Sangkug Lym , Zhongbo Zhu , Qi Zhang , Haochen Yuan , Xiaowei Ren , Deyu Fu , Tailai Ma , Shunkang Zhang , Jiang Shao , Ray Wang , Vasudevan Rengasamy , Rachit Garg , Santosh Bhavani , Xipeng Li , Chandler Zhou , David Wu , Yingcan Wei , Ashwath Aithal , Michael Andersch , Mohammad Shoeybi , Jiajie Yao , June Yang

BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling

The increasing complexity of modern deep neural network models and the expanding sizes of datasets necessitate the development of optimized and scalable training methods. In this white paper, we addressed the challenge of efficiently…

Machine Learning · Computer Science 2024-04-29 Raphael Ruschel , A. S. M. Iftekhar , B. S. Manjunath , Suya You

Distributed Training Large-Scale Deep Architectures

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Addressing Algorithmic Bottlenecks in Elastic Machine Learning with Chicle

Distributed machine learning training is one of the most common and important workloads running on data centers today, but it is rarely executed alone. Instead, to reduce costs, computing resources are consolidated and shared by different…

Machine Learning · Computer Science 2019-09-12 Michael Kaufmann , Kornilios Kourtis , Celestine Mendler-Dünner , Adrian Schüpbach , Thomas Parnell

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into…

Machine Learning · Computer Science 2025-06-24 Maryam Dialameh , Rezaul Karim , Hossein Rajabzadeh , Omar Mohamed Awad , Hyock Ju Kwon , Boxing Chen , Walid Ahmed , Yang Liu

Efficient Parallelization Layouts for Large-Scale Distributed Model Training

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

Simulating LLM training workloads for heterogeneous compute and network infrastructure

The growing demand for large-scale GPU clusters in distributed model training presents a significant barrier to innovation, particularly in model optimization, performance tuning, and system-level enhancements. To address this challenge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-08 Sumit Kumar , Arjun Temura , Naman Sharma , Ramanjeet Singh , Meet Dadhania , Praveen Tammana , Satananda Burla , Abed Mohammad Kamaluddin , Rinku Shah