分布式、并行与集群计算

FBench: A Flexible Benchmark for CFG-Based What-If Exploration of HPC I/O Patterns

The I/O performance of large-scale HPC applications depends on a complex interplay of access patterns, middleware optimizations, and file system configurations. To systematically explore these effects without repeatedly rerunning full…

分布式、并行与集群计算 · 计算机科学 2026-06-29 Zhaobin Zhu , Chen Wang , Kathryn Mohror , Sarah Neuwirth

Beyond Uniform Experts: Cost-Aware Expert Execution for Efficient Multi-Device MoE Inference

Mixture-of-Experts (MoE) architectures enable language models to achieve unprecedented scale via sparse activation. However, their inference performance is often limited by data movement bottlenecks. Two coupled challenges exacerbate this…

分布式、并行与集群计算 · 计算机科学 2026-06-29 Hui Zang , Pengfei Xia , Hong Liu , Jiajia Chu , Tuo Hao , Minghao Chen , Rui Zhang , Ziyang Zhang

SMART-MIG: A Learning Framework for Scalable and Energy-Efficient GPU Scheduling

The emergence of Multi-Instance GPU (MIG) technology enables us to run smaller machine learning models on partitions of a GPU rather than the entire device, thus improving utilization and reducing energy consumption, albeit with potential…

分布式、并行与集群计算 · 计算机科学 2026-06-29 Wenqing Yu , Neel Karia , Tanvi Hisaria , Clifford Stein , Olivier Tardieu , Asser Tantawi

Demystifying the Design Space and Best Practices for Heterogeneous LLM Inference and Serving

Heterogeneous prefill-decode (PD) inference is now in production: prefill on cost-efficient or supply-available accelerators, decode on bandwidth-strong ones, and KV state crossing mixed interconnects in mixed numerical formats. Each…

分布式、并行与集群计算 · 计算机科学 2026-06-29 Zhixin Wang , Zhengbo Wang , Fangcheng Fu , Yinhui Lu , Jinlong Hou , Yijie Chen , Xiaowei Shen , He Liu , Xiangbin Li , Jun Chen , Ruya Gu , Dian Wang , Zhou Tan , Yuan Cheng , Hongzhou Zhang , Xiangjun Huang , Ping Zhang , Xiaohe Hu

NI-ORCA: A Parallel Algorithm for Counting the Orbits of Non-Induced Graphlets up to K4

Counting the orbits of graphlets in a network is a vital tool for understanding the structural roles of vertices in various graph analytics tasks. While existing algorithms efficiently compute orbits of induced graphlets, many real-world…

分布式、并行与集群计算 · 计算机科学 2026-06-28 Syed Ibtisam Tauhidi , Arindam Karmakar , Thai Son Mai , Hans Vandierendonck

Energy-Efficient Multimodal Inference Serving with Tri-serve

Multimodal model inference creates substantial energy demand with growing performance requirements. Within GPUs, power is autonomously managed by an on-board power management unit (PMU), which makes frequency boosting/throttling decisions.…

分布式、并行与集群计算 · 计算机科学 2026-06-28 Ziyang Jia , Sara Rashidi Golrouye , Laxmi Bhuyan , Benjamin Kubwimana , Devashree Tripathy , Zexin Li , Cong Liu , Daniel Wong

Fog Computing and Large Language Models: A vision for the mutual beneficiaries

Fog computing utilizes proximal computational resources for sensor data processing and actuation, and addresses the latency, network load, and privacy issues of cloud-centric Internet of Things. On the other hand, Large Language Models…

分布式、并行与集群计算 · 计算机科学 2026-06-28 Satish Narayana Srirama

KernelFlume: Elastic Core-Attention Scaling for Agentic Long-Context Decoding

LLM serving is increasingly dominated by long and dynamic decode workloads from agents, reasoning models, and extended conversations. When bursty long-context demand exceeds deployed capacity, existing serving systems typically scale out by…

分布式、并行与集群计算 · 计算机科学 2026-06-28 Guangyu Xiang , Xueze Kang , Lin Zhang , Wenxiang Lin , Shaohuai Shi , Yuxin Wang , Xiaowen Chu

Are There Manufacturer Differences in Hard-Drive Reliability?

Based on the Backblaze hard disk drive (HDD) dataset, we analyze whether the four major HDD manufacturers represented in the dataset -- HGST, Seagate, Toshiba, Western Digital (WD) -- show differences in short- to medium-term HDD failure…

分布式、并行与集群计算 · 计算机科学 2026-06-27 Christoph Siemroth , Yeomyung Park

Importance-Aware Resource Allocation for Collaborative Task-Oriented Semantic Communication

Task-oriented semantic communication must allocate scarce radio resources to semantic features under fast fading wireless conditions and strict end-to-end latency budgets. Existing solutions are either optimization-heavy, leading to…

分布式、并行与集群计算 · 计算机科学 2026-06-27 Kaiyi Lei , Yuanzhe Peng , Letian Zhang , Jie Xu

Five Ways to Build a Concurrent Linked From Coarse-Grain Locking to Lock-Free Algorithms

Linked lists are one of the most basic data structures in computer science. But when many threads try to use the same linked list at the same time, things get complicated. In this paper, we look at five different ways to make a linked list…

分布式、并行与集群计算 · 计算机科学 2026-06-27 Zeeshan Mohammed Rangrej

Concurrent Splay-Based Tree

Most work on efficient concurrent ordered indices, such as concurrent binary search trees, B-trees, skip lists, etc., has focused on data structures that provide good \emph{worst-case} guarantees. In real workloads, objects are often…

分布式、并行与集群计算 · 计算机科学 2026-06-27 Vitaly Aksenov , Rene van Bevern , Artem Shilkin

CHAMB-GA: A Containerized HPC Scalable Microservice-Based Framework for Genetic Algorithms

Metaheuristic-based global optimization with embedded, long-running simulations is a computationally expensive process. To support various stages of development and execution, a seamless transition from personal computers to distributed…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Felix Bonhoff , Thiemo Pesch , Andrea Benigni , Alexander Mitsos , Manuel Dahmen

DMuon: Efficient Distributed Muon Training with Near-Adam Overhead

Matrix-orthogonalization-based optimizers, exemplified by Muon, have demonstrated strong convergence behavior across a wide range of modern deep learning workloads. The matrix-aware updates offer a compelling alternative to conventional…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Vincent Chen , Starrick Liu , Regis Cheng , Dance Yang , Shalfun Li , Ryan Yu , Lucy Liang , Hang Su , Roy Gan , Hao Wang , Qian Wang

RolloutPipe: Overlapping Pipelined Rollout and Training in Disaggregated On-Policy LLM Reinforcement Learning

Large language model (LLM) post-training for reasoning increasingly relies on reinforcement learning with verifiable rewards (RLVR), where models learn from ground-truth feedback on mathematical, logical, and scientific tasks. To enable…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Rongjian Chen , Jianmin Hu , Kejiang Ye , Minxian Xu

Simulating Unified Tensor Resharding in heterogeneous AI systems

State-of-the-art AI training simulators assume homogeneous compute and network infrastructure. However, real-world training infrastructure is becoming increasingly heterogeneous since: (a) Model architectures such as multimodal and MoE…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Sumit Kumar , Sayantan Dasgupta , Kushal Mitra , Meet Dadhania , Rohan Sudhir Basugade , Praveen Tammana , Satananda Burla , Abed Mohammad Kamaluddin , Rinku Shah

Moebius: Serving Mixture-of-Expert Models with Seamless Runtime Parallelism Switch

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) to hundreds of billions of parameters. Serving a single MoE model requires multiple GPUs operating in parallel, typically through tensor parallelism (TP) or expert…

分布式、并行与集群计算 · 计算机科学 2026-06-25 Shaoyu Wang , Yizhuo Liang , Jaeyong Song , Chong Li , Seo Jin Park

Priceless: An examination of Serverless Functions-as-a-Service (FaaS) pricing models

Serverless Functions-as-a-Service providers have grown in their offering since inception a decade ago, with a myriad of new functionalities offered to end-users. These new features have also brought new, varied and at times complex pricing…

分布式、并行与集群计算 · 计算机科学 2026-06-24 Nnamdi Ekwe-Ekwe

A Distributed Quantum Approximate Optimization Algorithm Simulator for Engineering Design Optimization

This paper presents a Qiskit-compatible distributed quantum approximate optimization algorithm (DQAOA) simulator for quadratic unconstrained binary optimization (QUBO) problems arising in engineering design and decision applications. The…

分布式、并行与集群计算 · 计算机科学 2026-06-24 Ali Rajabi , Milad Hasanzadeh , Amin Kargarian

FinWhale: An Optimally Resilient Two-Round Terminating DAG Protocol

DAG based Byzantine Fault Tolerant protocols provide high throughput consensus under partial synchrony but existing DAG protocols still require at least three message delays to commit decisions. In contrast fast path Byzantine Fault…

分布式、并行与集群计算 · 计算机科学 2026-06-24 Razya Ladelsky , Roy Friedman