Related papers: Large Scale Parallelization Using File-Based Commu…
HPC systems keep growing in size to meet the ever-increasing demand for performance and computational resources. Apart from increased performance, large scale systems face two challenges that hinder further growth: energy efficiency and…
The simulation of large ensembles of particles is usually parallelized by partitioning the domain spatially and using message passing to communicate between the processes handling neighboring subdomains. The particles are represented as…
MPI implementations commonly rely on explicit memory-copy operations, incurring overhead from redundant data movement and buffer management. This overhead notably impacts HPC workloads involving intensive inter-processor communication. In…
Recent increased interest in Cloud computing emphasizes the need to find an adequate solution to the load-balancing problem in parallel computing -- efficiently running several jobs concurrently on a cluster of shared computers (nodes). One…
Running parallel applications requires special and expensive processing resources to obtain the required results within a reasonable time. Before parallelizing serial applications, some analysis is recommended to be carried out to decide…
Nowadays the number of available processing cores within computing nodes which are used in recent clustered environments, are growing up with a rapid rate. Despite this trend, the number of available network interfaces in such computing…
In this work, a new parallel dual-grid multiscale approach for CFD-DEM couplings is investigated. Dual- grid multiscale CFD-DEM couplings have been recently developed and successfully adopted in different applications still, an efficient…
The increasing parallelism of many-core systems demands for efficient strategies for the run-time system management. Due to the large number of cores the management overhead has a rising impact to the overall system performance. This work…
Sequential computation is well understood but does not scale well with current technology. Within the next decade, systems will contain large numbers of processors with potentially thousands of processors per chip. Despite this, many…
Parallel algorithms relying on synchronous parallelization libraries often experience adverse performance due to global synchronization barriers. Asynchronous many-task runtimes offer task futurization capabilities that minimize or remove…
In the realm of Large Language Model (LLM) inference, the inherent structure of transformer models coupled with the multi-GPU tensor parallelism strategy leads to a sequential execution of computation and communication. This results in…
Parallel architectures are continually increasing in performance and scale, while underlying algorithmic infrastructure often fail to take full advantage of available compute power. Within the context of MPI, irregular communication…
Developing an efficient server-based real-time scheduling solution that supports dynamic task-level parallelism is now relevant to even the desktop and embedded domains and no longer only to the high performance computing market niche. This…
The parallel and distributed processing are becoming de facto industry standard, and a large part of the current research is targeted on how to make computing scalable and distributed, dynamically, without allocating the resources on…
We consider a large-scale parallel-server system, where each server independently adjusts its processing speed in a decentralized manner. The objective is to minimize the overall cost, which comprises the average cost of maintaining the…
Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to…
The current trend of multicore architectures on shared memory systems underscores the need of parallelism. While there are some programming model to express parallelism, thread programming model has become a standard to support these system…
Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable…
As multimodal and AI-driven services exchange hundreds of megabytes per request, existing IPC runtimes spend a growing share of CPU cycles on memory copies. Although both hardware and software mechanisms are exploring memory offloading,…
Many modern applications require real-time processing of large volumes of high-speed data. Such data processing needs can be modeled as a streaming computation. A streaming computation is specified as a dataflow graph that exposes multiple…