Related papers: Computron: Serving Distributed Deep Learning Model…
Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs…
The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as…
Training multi-billion to trillion-parameter language models efficiently on GPU clusters requires leveraging multiple parallelism strategies. We present Galvatron, a novel open-source framework (dubbed 'Optimus-Megatron' in the…
Training transformer models requires substantial GPU compute and memory resources. In homogeneous clusters, distributed strategies allocate resources evenly, but this approach is inefficient for heterogeneous clusters, where GPUs differ in…
Training and deploying deep learning models in real-world applications require processing large amounts of data. This is a challenging task when the amount of data grows to a hundred terabytes, or even, petabyte-scale. We introduce a hybrid…
General Purpose Graphics Processing Unit (GPGPU) computing plays a transformative role in deep learning and machine learning by leveraging the computational advantages of parallel processing. Through the power of Compute Unified Device…
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on…
The scaling of large language models (LLMs) is currently bottlenecked by the rigidity of distributed programming. While high-performance libraries like CuBLAS and NCCL provide optimized primitives, they lack the flexibility required for…
Recent advances in computing architectures and networking are bringing parallel computing systems to the masses so increasing the number of potential users of these kinds of systems. In particular, two important technological evolutions are…
Multicore parallel programming has some very difficult problems such as deadlocks during synchronizations and race conditions brought by concurrency. Added to the difficulty is the lack of a simple, well-accepted computing model for…
This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…
In the fusion community, the use of high performance computing (HPC) has been mostly dominated by heavy-duty plasma simulations, such as those based on particle-in-cell and gyrokinetic codes. However, there has been a growing interest in…
Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct…
CPU-GPU heterogeneous architectures are now commonly used in a wide variety of computing systems from mobile devices to supercomputers. Maximizing the throughput for multi-programmed workloads on such systems is indispensable as one single…
To effectively control large-scale distributed systems online, model predictive control (MPC) has to swiftly solve the underlying high-dimensional optimization. There are multiple techniques applied to accelerate the solving process in the…
GPUs have limited memory and it is difficult to train wide and/or deep models that cause the training process to go out of memory. It is shown in this paper how an open source tool called Large Model Support (LMS) can utilize a high…
It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However,…
Diffusion models have achieved great success in synthesizing high-quality images. However, generating high-resolution images with diffusion models is still challenging due to the enormous computational costs, resulting in a prohibitive…
Galvatron is a distributed system for efficiently training large-scale Foundation Models. It overcomes the complexities of selecting optimal parallelism strategies by automatically identifying the most efficient hybrid strategy,…
We present a technique designed for parallelizing large rigid body simulations, capable of exploiting multiple CPU cores within a computer and across a network. Our approach can be applied to simulate both unilateral and bilateral…