分布式、并行与集群计算

CHAMB-GA: A Containerized HPC Scalable Microservice-Based Framework for Genetic Algorithms

Metaheuristic-based global optimization with embedded, long-running simulations is a computationally expensive process. To support various stages of development and execution, a seamless transition from personal computers to distributed…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Felix Bonhoff , Thiemo Pesch , Andrea Benigni , Alexander Mitsos , Manuel Dahmen

DMuon: Efficient Distributed Muon Training with Near-Adam Overhead

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Vincent Chen , Starrick Liu , Regis Cheng , Dance Yang , Shalfun Li , Ryan Yu , Lucy Liang , Hang Su , Roy Gan , Hao Wang , Qian Wang

RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Rongjian Chen , Jianmin Hu , Kejiang Ye , Minxian Xu

Simulating Unified Tensor Resharding in heterogeneous AI systems

State-of-the-art AI training simulators assume homogeneous compute and network infrastructure. However, real-world training infrastructure is becoming increasingly heterogeneous since: (a) Model architectures such as multimodal and MoE…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Sumit Kumar , Sayantan Dasgupta , Kushal Mitra , Meet Dadhania , Rohan Sudhir Basugade , Praveen Tammana , Satananda Burla , Abed Mohammad Kamaluddin , Rinku Shah

Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) to hundreds of billions of parameters. Serving a single MoE model requires multiple GPUs operating in parallel, typically through tensor parallelism (TP) or expert…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Shaoyu Wang , Yizhuo Liang , Jaeyong Song , Chong Li , Seo Jin Park

Priceless: An examination of Serverless Functions-as-a-Service (FaaS) pricing models

Serverless Functions-as-a-Service providers have grown in their offering since inception a decade ago, with a myriad of new functionalities offered to end-users. These new features have also brought new, varied and at times complex pricing…

分布式、并行与集群计算 · 计算机科学 2026-06-24 Nnamdi Ekwe-Ekwe

A Distributed Quantum Approximate Optimization Algorithm Simulator for Engineering Design Optimization

This paper presents a Qiskit-compatible distributed quantum approximate optimization algorithm (DQAOA) simulator for quadratic unconstrained binary optimization (QUBO) problems arising in engineering design and decision applications. The…

分布式、并行与集群计算 · 计算机科学 2026-06-24 Ali Rajabi , Milad Hasanzadeh , Amin Kargarian

FinWhale: An Optimally Resilient Two-Round Terminating DAG Protocol

DAG based Byzantine Fault Tolerant protocols provide high throughput consensus under partial synchrony but existing DAG protocols still require at least three message delays to commit decisions. In contrast fast path Byzantine Fault…

分布式、并行与集群计算 · 计算机科学 2026-06-24 Razya Ladelsky , Roy Friedman

Hot AI in Cold Space: Thermal-Crosstalk-Aware Scheduling for Sustainable Orbital AI Clusters

Terrestrial AI training faces an unsustainable energy and water crisis, positioning Orbital Data Centers (ODCs) as a "zero operational carbon" alternative. However, the sub-$10\mu\text{s}$ communication latency required for distributed…

分布式、并行与集群计算 · 计算机科学 2026-06-23 Shuyi Chen , Zhengchang Hua , Nikos Tziritas , Georgios Theodoropoulos

RAFI -- A Ray/Work Forwarding Infrastructure for Data Parallel Multi-Node/Multi-GPU Computing

We present RaFI, a CUDA and MPI based software framework that simplifies the task of building GPU-enabled data-parallel software where rays or similar work items need to migrate between different GPUs. RaFI provides a simple interface for…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Ingo Wald , Serkan Demirci , Alper Sahistan , Stefan Zellmann , Andrea Paris , Patrick Moran , Milan Jaros , Tatiana von Landesberger , Ugur Gudukbay , Valerio Pascucci

Effective MPI: User-defined Datatypes and Cartesian Communicators for Zero-copy All-to-all Communication in Multidimensional Tori

We present and show how to implement a non-trivial all-to-all communication algorithm for arbitrary $d$-dimensional tori effectively in MPI. Given a factorization of the number of processes $p$ into $d$ factors that can be mapped onto a…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Jesper Larsson Träff

CARM Tool: Cache-Aware Roofline Model Automatic Benchmarking and Application Analysis

In recent years, HPC systems and CPU architectures as their central components, have become increasingly complex, making application development and optimization quite challenging. In this respect, intuitive performance models like the…

分布式、并行与集群计算 · 计算机科学 2026-05-29 José Morgado , Leonel Sousa , Aleksandar Ilic

PRISM: Processing-In-Memory Sparse MTTKRP for Tensor Decomposition Acceleration

Sparse tensors are the most used representation of sparse multidimensional data. Operations that decompose them, selecting their most important features while reducing their dimension, have become prevalent procedures in machine learning.…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Daniel Pacheco , Leonel Sousa , Aleksandar Ilic

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training

Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Ling Chen , Houming Wu , Wenjie Yu

TC-MIS: Maximal Independent Set on Tensor-cores

Maximal Independent Set (MIS) in a graph is a fundamental problem with applications in resource allocation, scheduling, and network optimization. Although graphs are inherently un-structured and challenging for GPU parallelism due to…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Prajjwal Nijhara , Dip Sankar Banerjee

Design and Implementation of a Serverless MapReduce Framework for Scalable Data Pipelines

Modern logistics systems tend to generate continuous streams of data from sources such as GPS, IoT sensors, and logistics management systems. The aggregation, processing, and analysis of data have become vital for monitoring operations,…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Angelos Dorotheos Chatzopoulos , Babis Andreou , Kakia Panagidi , Stathes Hadjiefthymiades

Silent Data Corruption Protection through Efficient Task Replication

The trend of increasing cluster sizes of supercomputers leads to a growing susceptibility to Silent Data Corruption (SDC) that can invalidate program results. A common strategy for SDC protection is replication, where the computation is…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Mia Reitz , Claudia Fohry

Understanding and Reducing Metadata-Driven Host Overheads in Sampling-Based GNN Training

Modern deep learning workloads increasingly exhibit dynamic, metadata-driven execution, where runtime-generated information determines memory provisioning and kernel launch decisions. In sampling-based graph neural network (GNN) training,…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Yidong Gong , Saima Afrin , Yuchen Ma , Guannan Wang , Bin Ren , Pradeep Kumar

HPC-vQPU: A Service-Export Architecture for Virtual QPUs on Batch-Scheduled HPC Systems

Device-aware quantum simulation increasingly requires HPC-scale accelerators, yet secure supercomputers expose batch-scheduled execution environments rather than the interactive, backend-oriented interfaces expected by quantum software. The…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Shusen Liu , Pascal Jahan Elahi , Ugo Varetto

Monte Cimone v3: Where RISC-V Stands in High-Performance Computing

The Monte Cimone project provides a RISC-V testbed for High-Performacne Computing cluster. This paper presents Monte Cimone v3 (MCv3), the third iteration of the Monte Cimone RISC-V HPC cluster, integrating the SOPHGO Sophon SG2044…

分布式、并行与集群计算 · 计算机科学 2026-05-29 Emanuele Venieri , Simone Manoni , Giacomo Madella , Federico Proverbio , Federico Ficarelli , Luca Benini , Andrea Bartolini