Echo: Simulating Distributed Training At Scale

Yicheng Feng; Yuetao Chen; Kaiwen Chen; Jingzong Li; Tianyuan Wu; Peng Cheng; Chuan Wu; Wei Wang; Tsung-Yi Ho; Hong Xu

Echo: Simulating Distributed Training At Scale

Machine Learning 2024-12-18 v1 Distributed, Parallel, and Cluster Computing

Authors: Yicheng Feng , Yuetao Chen , Kaiwen Chen , Jingzong Li , Tianyuan Wu , Peng Cheng , Chuan Wu , Wei Wang , Tsung-Yi Ho , Hong Xu

View on arXiv ↗ PDF ↗

Abstract

Simulation offers unique values for both enumeration and extrapolation purposes, and is becoming increasingly important for managing the massive machine learning (ML) clusters and large-scale distributed training jobs. In this paper, we build Echo to tackle three key challenges in large-scale training simulation: (1) tracing the runtime training workloads at each device in an ex-situ fashion so we can use a single device to obtain the actual execution graphs of 1K-GPU training, (2) accurately estimating the collective communication without high overheads of discrete-event based network simulation, and (3) accounting for the interference-induced computation slowdown from overlapping communication and computation kernels on the same device. Echo delivers on average 8% error in training step -- roughly 3x lower than state-of-the-art simulators -- for GPT-175B on a 96-GPU H800 cluster with 3D parallelism on Megatron-LM under 2 minutes.

Keywords

simulation large language model training large language model inference

Cite

@article{arxiv.2412.12487,
  title  = {Echo: Simulating Distributed Training At Scale},
  author = {Yicheng Feng and Yuetao Chen and Kaiwen Chen and Jingzong Li and Tianyuan Wu and Peng Cheng and Chuan Wu and Wei Wang and Tsung-Yi Ho and Hong Xu},
  journal= {arXiv preprint arXiv:2412.12487},
  year   = {2024}
}

Echo: Simulating Distributed Training At Scale

Abstract

Keywords

Cite

Related papers