Related papers: Efficient and Eventually Consistent Collective Ope…

Efficient All-to-All Collective Communication Schedules for Direct-Connect Topologies

The all-to-all collective communications primitive is widely used in machine learning (ML) and high performance computing (HPC) workloads, and optimizing its performance is of interest to both ML and HPC communities. All-to-all is a…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-29 Prithwish Basu , Liangyu Zhao , Jason Fantl , Siddharth Pal , Arvind Krishnamurthy , Joud Khoury

MPI Collectives for Multi-core Clusters: Optimized Performance of the Hybrid MPI+MPI Parallel Codes

The advent of multi-/many-core processors in clusters advocates hybrid parallel programming, which combines Message Passing Interface (MPI) for inter-node parallelism with a shared memory model for on-node parallelism. Compared to the…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-15 Huan Zhou , Jose Gracia , Ralf Schneider

Making Applications Faster by Asynchronous Execution: Slowing Down Processes or Relaxing MPI Collectives

Comprehending the performance bottlenecks at the core of the intricate hardware-software interactions exhibited by highly parallel programs on HPC clusters is crucial. This paper sheds light on the issue of automatically asynchronous MPI…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-06 Ayesha Afzal , Georg Hager , Stefano Markidis , Gerhard Wellein

Collective Communication Profiling of Modern-day Machine Learning Workloads

Machine Learning jobs, carried out on large number of distributed high performance systems, involve periodic communication using operations like AllReduce, AllGather, and Broadcast. These operations may create high bandwidth and bursty…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-11 Jit Gupta , Andrew Li , Tarun Banka , Ariel Cohen , T. Sridhar , Raj Yavatkar

Collectives in hybrid MPI+MPI code: design, practice and performance

The use of hybrid scheme combining the message passing programming models for inter-node parallelism and the shared memory programming models for node-level parallelism is widely spread. Existing extensive practices on hybrid Message…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-23 Huan Zhou , Jose Gracia , Naweiluo Zhou , Ralf Schneider

SparCML: High-Performance Sparse Communication for Machine Learning

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-19 Cedric Renggli , Saleh Ashkboos , Mehdi Aghagolzadeh , Dan Alistarh , Torsten Hoefler

Accelerating MPI Collectives with Process-in-Process-based Multi-object Techniques

In the exascale computing era, optimizing MPI collective performance in high-performance computing (HPC) applications is critical. Current algorithms face performance degradation due to system call overhead, page faults, or data-copy…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-19 Jiajun Huang , Kaiming Ouyang , Yujia Zhai , Jinyang Liu , Min Si , Ken Raffenetti , Hui Zhou , Atsushi Hori , Zizhong Chen , Yanfei Guo , Rajeev Thakur

A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

Offload of MPI collectives to network devices, e.g., NICs and switches, is being implemented as an effective mechanism to improve application performance by reducing inter- and intra-node communication and bypassing MPI software layers.…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-01 Pouya Haghi , Ryan Marshall , Po Hao Chen , Anthony Skjellum , Martin Herbordt

Optimal, Non-pipelined Reduce-scatter and Allreduce Algorithms

The reduce-scatter collective operation in which $p$ processors in a network of processors collectively reduce $p$ input vectors into a result vector that is partitioned over the processors is important both in its own right and as building…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-14 Jesper Larsson Träff

LLMapReduce: Multi-Level Map-Reduce for High Performance Data Analysis

The map-reduce parallel programming model has become extremely popular in the big data community. Many big data workloads can benefit from the enhanced performance offered by supercomputers. LLMapReduce provides the familiar map-reduce…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-12-13 Chansup Byun , Jeremy Kepner , William Arcand , David Bestor , Bill Bergeron , Vijay Gadepally , Matthew Hubbell , Peter Michaleas , Julie Mullen , Andrew Prout , Antonio Rosa , Charles Yee , Albert Reuther

PICO: Performance Insights for Collective Operations

Collective operations are cornerstones of both HPC applications and large-scale AI training and inference, yet benchmarking them in a systematic and reproducible way remains difficult on modern systems due to the complexity of their…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-14 Saverio Pasqualoni , Tommaso Bonato , Lorenzo Piarulli , Torsten Hoefler , Marco Canini , Daniele De Sensi

ZCCL: Significantly Improving Collective Communication With Error-Bounded Lossy Compression

With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communication turns out to be a critical bottleneck in large-scale distributed and parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-02-27 Jiajun Huang , Sheng Di , Xiaodong Yu , Yujia Zhai , Zhaorui Zhang , Jinyang Liu , Xiaoyi Lu , Ken Raffenetti , Hui Zhou , Kai Zhao , Khalid Alharthi , Zizhong Chen , Franck Cappello , Yanfei Guo , Rajeev Thakur

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters are trained and served on large multi-GPU systems. As models grow in size and execute on more GPUs, the collective communications used in these applications become a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-21 Meghan Cowan , Saeed Maleki , Madanlal Musuvathi , Olli Saarikivi , Yifan Xiong

Optimizing Allreduce Operations for Modern Heterogeneous Architectures with Multiple Processes per GPU

Large inter-GPU all-reduce operations, prevalent throughout deep learning, are bottlenecked by communication costs. Emerging heterogeneous architectures are comprised of complex nodes, often containing $4$ GPUs and dozens to hundreds of CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-26 Michael Adams , Amanda Bienz

Optimizing Distributed ML Communication with Fused Computation-Collective Operations

In order to satisfy their ever increasing capacity and compute requirements, machine learning models are distributed across multiple nodes using numerous parallelism strategies. As a result, collective communications are often on the…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-24 Kishore Punniyamurthy , Khaled Hamidouche , Bradford M. Beckmann

Accurate runtime selection of optimal MPI collective algorithms using analytical performance modelling

The performance of collective operations has been a critical issue since the advent of MPI. Many algorithms have been proposed for each MPI collective operation but none of them proved optimal in all situations. Different algorithms…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-04-24 Emin Nuriyev , Alexey Lastovetsky

Efficient Communications in Training Large Scale Neural Networks

We consider the problem of how to reduce the cost of communication that is required for the parallel training of a neural network. The state-of-the-art method, Bulk Synchronous Parallel Stochastic Gradient Descent (BSP-SGD), requires many…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-18 Linnan Wang , Wei Wu , George Bosilca , Richard Vuduc , Zenglin Xu

High performance scheduling of mixed-mode DAGs on heterogeneous multicores

Many HPC applications can be expressed as mixed-mode computations, in which each node of a computational DAG is itself a parallel computation that can be molded at runtime to allocate different amounts of processing resources. At the same…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-07-10 Agnes Rohlin , Henrik Fahlgren , Miquel Pericas

An Optimized Error-controlled MPI Collective Framework Integrated with Lossy Compression

With the ever-increasing computing power of supercomputers and the growing scale of scientific applications, the efficiency of MPI collective communications turns out to be a critical bottleneck in large-scale distributed and parallel…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-19 Jiajun Huang , Sheng Di , Xiaodong Yu , Yujia Zhai , Zhaorui Zhang , Jinyang Liu , Xiaoyi Lu , Ken Raffenetti , Hui Zhou , Kai Zhao , Zizhong Chen , Franck Cappello , Yanfei Guo , Rajeev Thakur

Flare: Flexible In-Network Allreduce

The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches,…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-10-28 Daniele De Sensi , Salvatore Di Girolamo , Saleh Ashkboos , Shigang Li , Torsten Hoefler