分布式、并行与集群计算
As high-performance computing systems scale in size and complexity, efficient resource management is essential to minimize communication overhead. The HyperX is a richly connected, low-diameter network that offers a scalable and…
The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing…
Multimodal LLM datasets are inherently heterogeneous, with significant data variability. Although each modality exhibits independent variability, sample-level entanglement makes it difficult to balance workloads across both modalities and…
Neighbor graphs capture relationships among data points and are widely used in data analytics and AI workloads. Many studies have explored approximate construction methods for single-node systems, including GPUs. However, extending this to…
As datacenters continue to grow in scale, their energy consumption and resulting carbon footprint have become pressing concerns. With the increasing share of renewable energy in a datacenter's mixed energy supply, shifting task execution to…
Estimating CPU power on heterogeneous ARM-based commodity devices is challenging due to limited access to CPU's voltage domains. As a result, state-of-the-art energy-aware Federated Learning (FL) frameworks typically rely on simplified…
Pre-implementation behavioural simulation routinely validates functional correctness, yet it also produces rich switching-activity traces that are typically discarded by FPGA computer-aided design (CAD) flows. Prior simulation-guided and…
OpenMP is a popular parallelization framework that lets users transform sequential code into parallel code with a few simple annotations. Unfortunately, it is also easy to inadvertently introduce errors by adding OpenMP pragmas into…
In a landscape of high-performance distributed ML systems, JAX has emerged as a framework of choice. However, JAX's modular design philosophy leaves it without a standardized checkpointing solution. In this paper, we introduce Orbax, a…
Spot instances offer significant cost savings of up to 90% over on-demand prices, making them an attractive resource for large-scale computing workloads. However, understanding their availability dynamics is essential for building systems…
Drawing on ideas from continuous integration, we present concepts of an automated benchmarking pipeline for high performance applications. Customization and collaboration have been key design goals owing to the requirements of…
With ever-increasing computational capabilities, robust and automated research workflows have become essential for orchestrating large numbers of interdependent simulations. However, significant technical expertise is still required to…
The edge-cloud computing continuum demands self-management mechanisms that scale across autonomous administrative domains while honouring tenant- and operator-specified data sovereignty. We present Neural Pub/Sub, a federated-broker…
Nonlinear reformulations of the spectral clustering method have gained a lot of recent attention due to their increased numerical benefits and their solid mathematical background. However, the estimation of the multiple nonlinear…
All-to-All communication is a key performance bottleneck for distributed machine learning (ML) and high-performance computing (HPC) workloads, where dense traffic increasingly stresses scale-up interconnects. While these ML and HPC…
Divide and Conquer (D&C) is a widely used algorithmic strategy for symmetric eigenvalue decomposition. Its natural parallelism makes D&C attractive on modern multicore CPUs and GPUs, but existing eigenvalue-only routines often default to…
Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade…
NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault…
Modern online services rely on third-party APIs for authentication, payments, communication, identity verification, fraud detection, observability, and fulfillment. These dependencies are outside the direct operational control of the…
At global scale, data-center electricity demand is growing faster than the grids that supply it, while system operators increasingly require large flexible loads that can adjust power within seconds to absorb variable wind and solar…