English
Related papers

Related papers: Echo: Simulating Distributed Training At Scale

200 papers

Modern RL-based post-training for large language models (LLMs) co-locate trajectory sampling and policy optimisation on the same GPU cluster, forcing the system to switch between inference and training workloads. This serial context…

Machine Learning · Computer Science 2025-08-13 Jie Xiao , Changyuan Fan , Qingnan Ren , Alfred Long , Yuchen Zhang , Rymon Yu , Eric Yang , Lynn Ai , Shaoduo Gan

Research on large language models (LLMs) has shown remarkable performance in domains such as mathematics, programming, and literary creation. However, most studies have focused on semantic memory-based question answering, neglecting LLMs'…

Computation and Language · Computer Science 2025-02-25 WenTao Liu , Ruohua Zhang , Aimin Zhou , Feng Gao , JiaLi Liu

Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers…

Training large-scale models relies on a vast number of computing resources. For example, training the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs . It is a challenge to build a large-scale cluster with one type of…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-12 Si Xu , Zixiao Huang , Yan Zeng , Shengen Yan , Xuefei Ning , Quanlu Zhang , Haolin Ye , Sipei Gu , Chunsheng Shui , Zhezheng Lin , Hao Zhang , Sheng Wang , Guohao Dai , Yu Wang

Training machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Amedeo Sapio , Marco Canini , Chen-Yu Ho , Jacob Nelson , Panos Kalnis , Changhoon Kim , Arvind Krishnamurthy , Masoud Moshref , Dan R. K. Ports , Peter Richtárik

Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-08 Mingzhen Li , Wencong Xiao , Biao Sun , Hanyu Zhao , Hailong Yang , Shiru Ren , Zhongzhi Luan , Xianyan Jia , Yi Liu , Yong Li , Wei Lin , Depei Qian

Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern…

Machine Learning · Computer Science 2025-04-15 Jared Fernandez , Luca Wehrstedt , Leonid Shamis , Mostafa Elhoushi , Kalyan Saladi , Yonatan Bisk , Emma Strubell , Jacob Kahn

AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art,…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-12 Michael Benington , Leo Phan , Chris Pierre Paul , Evan Shoemaker , Priyanka Ranade , Torstein Collett , Grant Hodgson Perez , Christopher Krieger

Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-15 Mingyu Liang , Hiwot Tadese Kassa , Wenyin Fu , Brian Coutinho , Louis Feng , Christina Delimitrou

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…

The fundamental success of large language models hinges upon the efficacious implementation of large-scale distributed training techniques. Nevertheless, building a vast, high-performance cluster featuring high-speed communication…

Computation and Language · Computer Science 2024-01-30 Weigao Sun , Zhen Qin , Weixuan Sun , Shidi Li , Dong Li , Xuyang Shen , Yu Qiao , Yiran Zhong

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-28 Dimitar Mileski , Nikola Petrovski , Marjan Gusev

Large-scale Ads recommendation and auction scoring models at Google scale demand immense computational resources. While specialized hardware like TPUs have improved linear algebra computations, bottlenecks persist in large-scale systems.…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-01-22 George Kurian , Somayeh Sardashti , Ryan Sims , Felix Berger , Gary Holt , Yang Li , Jeremiah Willcock , Kaiyuan Wang , Herve Quiroz , Abdulrahman Salem , Julian Grady

Scaling Mixture-of-Experts (MoE) training introduces systems challenges absent in dense models. Because each token activates only a subset of experts, this sparsity allows total parameters to grow much faster than per-token computation,…

The increasing complexity of modern deep neural network models and the expanding sizes of datasets necessitate the development of optimized and scalable training methods. In this white paper, we addressed the challenge of efficiently…

Machine Learning · Computer Science 2024-04-29 Raphael Ruschel , A. S. M. Iftekhar , B. S. Manjunath , Suya You

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-09-21 Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

Distributed machine learning training is one of the most common and important workloads running on data centers today, but it is rarely executed alone. Instead, to reduce costs, computing resources are consolidated and shared by different…

Machine Learning · Computer Science 2019-09-12 Michael Kaufmann , Kornilios Kourtis , Celestine Mendler-Dünner , Adrian Schüpbach , Thomas Parnell

This paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into…

Machine Learning · Computer Science 2025-06-24 Maryam Dialameh , Rezaul Karim , Hossein Rajabzadeh , Omar Mohamed Awad , Hyock Ju Kwon , Boxing Chen , Walid Ahmed , Yang Liu

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding…

Machine Learning · Computer Science 2024-09-25 Johannes Hagemann , Samuel Weinbach , Konstantin Dobler , Maximilian Schall , Gerard de Melo

The growing demand for large-scale GPU clusters in distributed model training presents a significant barrier to innovation, particularly in model optimization, performance tuning, and system-level enhancements. To address this challenge,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-08-08 Sumit Kumar , Arjun Temura , Naman Sharma , Ramanjeet Singh , Meet Dadhania , Praveen Tammana , Satananda Burla , Abed Mohammad Kamaluddin , Rinku Shah
‹ Prev 1 2 3 10 Next ›