分布式、并行与集群计算

Context-aware Simopt-Power: Using structural data with simulation metadata to optimise FPGA designs

Pre-implementation behavioural simulation routinely validates functional correctness, yet it also produces rich switching-activity traces that are typically discarded by FPGA computer-aided design (CAD) flows. Prior simulation-guided and…

分布式、并行与集群计算 · 计算机科学 2026-05-28 Eashan Wadhwa , Georgios Floros , Shanker Shreejith

A Formal Semantics of C with OpenMP Parallelism (Extended Version)

OpenMP is a popular parallelization framework that lets users transform sequential code into parallel code with a few simple annotations. Unfortunately, it is also easy to inadvertently introduce errors by adding OpenMP pragmas into…

分布式、并行与集群计算 · 计算机科学 2026-05-28 Ke Du , Anshu Sharma , Liyi Li , William Mansky

Orbax: Distributed Checkpointing with JAX

In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a…

分布式、并行与集群计算 · 计算机科学 2026-05-28 Colin Gaffney , Shutong Li , Daniel Ng , Anastasia Petrushkina , Niket Kumar , Adam Cogdell , Mridul Sahu , Yaning Liang , Nikhil Bansal , Justin Pan , Angel Mau , Abhishek Agrawal , Marco Berlot , Ruoxin Sang , Kiranbir Sodhia , Rakesh Iyer

Ding-Dong Ditch: Peeking Into Spot Instance Availability

Spot instances offer significant cost savings of up to 90% over on-demand prices, making them an attractive resource for large-scale computing workloads. However, understanding their availability dynamics is essential for building systems…

分布式、并行与集群计算 · 计算机科学 2026-05-28 Kyumin Kim , Moohyun Song , Taeyoon Kim , Kyungyong Lee

Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies

Drawing on ideas from continuous integration, we present concepts of an automated benchmarking pipeline for high performance applications. Customization and collaboration have been key design goals owing to the requirements of…

分布式、并行与集群计算 · 计算机科学 2026-05-28 Jan Vogelsang , Melissa Lober , Catherine Mia Schöfmann , José Villamar , Dennis Terhorst , Johanna Senk , Hans Ekkehard Plesser , Markus Diesmann , Susanne Kunkel , Anno C. Kurth

Accelerating discovery across scientific disciplines through reproducible workflows with AiiDAlab

With ever-increasing computational capabilities, robust and automated research workflows have become essential for orchestrating large numbers of interdependent simulations. However, significant technical expertise is still required to…

分布式、并行与集群计算 · 计算机科学 2026-05-28 Aliaksandr V. Yakutovich , Daniel Hollas , Edan Bainglass , Jusong Yu , Corsin Battaglia , Miki Bonacci , Lucas Fernandez Vilanova , Stephan Henne , Anders Kaestner , Michel Kenzelmann , Graham Kimbell , Jakob Lass , Fabio Lopes , Daniel G. Mazzone , Andres Ortega-Guerrero , Xing Wang , Nicola Marzari , Carlo A. Pignedoli , Giovanni Pizzi

Autonomic Federated-Market Orchestration for the Edge-Cloud Continuum

The edge-cloud computing continuum demands self-management mechanisms that scale across autonomous administrative domains while honouring tenant- and operator-specified data sovereignty. We present Neural Pub/Sub, a federated-broker…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Lauri Lovén , Roberto Morabito , Abhishek Kumar , Susanna Pirttikangas , Jukka Riekki , Sasu Tarkoma

Nonlinear spectral clustering with C++ GraphBLAS

Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. However, the estimation of the multiple nonlinear…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Dimosthenis Pasadakis , Olaf Schenk , Verner Vlacic , Albert-Jan Yzelman

Revisiting Bruck: Phase-Efficient All-to-All Communication in Reconfigurable Networks

All-to-All communication is a key performance bottleneck for distributed machine learning (ML) and high-performance computing (HPC) workloads, where dense traffic increasingly stresses scale-up interconnects. While these ML and HPC…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Anton Juerss , Stefan Schmid

Reducing Internal State in Eigenvalue-Only Divide-and-Conquer Tridiagonal Eigensolvers

Divide and Conquer (D&C) is a widely used algorithmic strategy for symmetric eigenvalue decomposition. Its natural parallelism makes D&C attractive on modern multicore CPUs and GPUs, but existing eigenvalue-only routines often default to…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Ruiyi Zhan , Shaoshuai Zhang

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Minh K. Quan , Pubudu N. Pathirana

Characterization-Guided GPU Fault Resilience in NVIDIA MPS

NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Rixin Liu , Xingqi Cui , Kaijian Wang , Xinheng Ding , Zirui Liu , Yuke Wang , Jiarong Xing

Configuration-Driven Dynamic API Routing for Resilient Service Integrations

Modern online services rely on third-party APIs for authentication, payments, communication, identity verification, fraud detection, observability, and fulfillment. These dependencies are outside the direct operational control of the…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Nataraj Agaram Sundar , Tejas Morabia

GridPilot: Real-Time Grid-Responsive Control for AI Supercomputers

At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Denisa-Andreea Constantinescu , David Atienza

Totoro$^+$: An Adaptive and Scalable Edge Federated Learning System

Federated Learning (FL) is an emerging distributed machine learning (ML) technique that enables in-situ model training and inference on decentralized edge devices. We propose Totoro$^+$, a novel scalable FL system that enables massive FL…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Cheng-Wei Ching , Xin Chen , Taehwan Kim , Jian-Jhih Kuo , Dilma Da Silva , Liting Hu

Agentic AI Workload Characteristics

Agentic AI shifts LLM serving from isolated prompt-generation requests to stateful, multi-turn executions that repeatedly invoke the model, call tools, and grow context over time. This paper characterizes ReAct-style agents from both the…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Yichao Yuan , Ankita Nayak , Souvik Kundu , Nishil Talati

Semantic-aware Token Selection and Resource Optimization for Communication-efficient Split Federated Fine-tuning in Edge Intelligence

Deploying large Transformer-based vision models on resource-limited mobile devices at network edge is severely constrained by hardware limitations and dynamic wireless environments. While federated learning (FL) enables collaborative…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Xianke Qiang , Zheng Chang , Geyong Min

Edge AI Deployment Beyond Models: A BSP-Aware Systems Framework for Industrial Embedded Platforms

Industrial Edge AI programs often begin with the model and only later confront the platform. That sequencing is attractive because it allows early demonstrations, but it breaks down when the deployment target is an embedded system with long…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Pitchai Muthu M

Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU

Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds -- to…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Marcin Spoczynski , Daniel Fleischer , Moshe Berchansky , Gabriela Ben-Melech Stan , Shira Guskin , Weilin Xu , Adam Siemieniuk , Alexander Heinecke

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

Large-scale AI training is now fundamentally a distributed systems problem, and hardware failures have become routine operating conditions rather than rare exceptions. Public operational evidence from production training clusters, however,…

分布式、并行与集群计算 · 计算机科学 2026-05-27 Daemyung Kang , Eunjin Hwang , Hanjeong Lee , HyeokJin Kim , Hyunhoi Koo , Jeongkyu Shin , Jeongseok Kang , Jihyun Kang , Joongi Kim , Junbum Lee , Jungseung Yang , Kyujin Cho , Youngsook Song