Related papers: Rethinking Inter-Process Communication with Memory…

Offloading Artificial Intelligence Workloads across the Computing Continuum by means of Active Storage Systems

The increasing demand for artificial intelligence (AI) workloads across diverse computing environments has driven the need for more efficient data management strategies. Traditional cloud-based architectures struggle to handle the sheer…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-03 Alex Barceló , Sebastián A. Cajas Ordoñez , Jaydeep Samanta , Andrés L. Suárez-Cetrulo , Romila Ghosh , Ricardo Simón Carbajo , Anna Queralt

Taming Offload Overheads in a Massively Parallel Open-Source RISC-V MPSoC: Analysis and Optimization

Heterogeneous multi-core architectures combine on a single chip a few large, general-purpose host cores, optimized for single-thread performance, with (many) clusters of small, specialized, energy-efficient accelerator cores for…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-12 Luca Colagrande , Luca Benini

Cache-Conscious Run-time Decomposition of Data Parallel Computations

Multi-core architectures feature an intricate hierarchy of cache memories, with multiple levels and sizes. To adequately decompose an application according to the traits of a particular memory hierarchy is a cumbersome task that may be…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-20 Hervé Paulino , Nuno Delgado

Periodic I/O scheduling for super-computers

With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in super-computers. Architectural enhancement such as burst-buffers and pre-fetching are added to machines, but are not sufficient to…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-02-23 Guillaume Aupy , Ana Gainaru , Valentin Le Fèvre

Optimizing Offload Performance in Heterogeneous MPSoCs

Heterogeneous multi-core architectures combine a few "host" cores, optimized for single-thread performance, with many small energy-efficient "accelerator" cores for data-parallel processing, on a single chip. Offloading a computation to the…

Hardware Architecture · Computer Science 2025-11-11 Luca Colagrande , Luca Benini

Techniques for Shared Resource Management in Systems with Throughput Processors

The continued growth of the computational capability of throughput processors has made throughput processors the platform of choice for a wide variety of high performance computing applications. Graphics Processing Units (GPUs) are a prime…

Hardware Architecture · Computer Science 2018-05-01 Rachata Ausavarungnirun

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-29 Naveen Namashivayam , Krishna Kandalla , James B White , Larry Kaplan , Mark Pagel

Decoupled Strategy for Imbalanced Workloads in MapReduce Frameworks

In this work, we consider the integration of MPI one-sided communication and non-blocking I/O in HPC-centric MapReduce frameworks. Using a decoupled strategy, we aim to overlap the Map and Reduce phases of the algorithm by allowing…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-10 Sergio Rivas-Gomez , Sai Narasimhamurthy , Keeran Brabazon , Oliver Perks , Erwin Laure , Stefano Markidis

Advancements in Traffic Processing Using Programmable Hardware Flow Offload

The exponential growth of data traffic and the increasing complexity of networked applications demand effective solutions capable of passively inspecting and analysing the network traffic for monitoring and security purposes. Implementing…

Networking and Internet Architecture · Computer Science 2024-07-24 Luca Deri , Alfredo Cardigliano , Francesco Fusco

PUL: Pre-load in Software for Caches Wouldn't Always Play Along

Memory latencies and bandwidth are major factors, limiting system performance and scalability. Modern CPUs aim at hiding latencies by employing large caches, out-of-order execution, or complex hardware prefetchers. However, software-based…

Databases · Computer Science 2025-06-23 Arthur Bernhardt , Sajjad Tamimi , Florian Stock , Andreas Koch , Ilia Petrov

Workflow-Driven Modeling for the Compute Continuum: An Optimization Approach to Automated System and Workload Scheduling

The convergence of IoT, Edge, Cloud, and HPC technologies creates a compute continuum that merges cloud scalability and flexibility with HPC's computational power and specialized optimizations. However, integrating cloud and HPC resources…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-20 Aasish Kumar Sharma , Christian Boehme , Patrick Gelß , Ramin Yahyapour , Julian Kunkel

Interprocess Communication in FreeBSD 11: Performance Analysis

Interprocess communication, IPC, is one of the most fundamental functions of a modern operating system, playing an essential role in the fabric of contemporary applications. This report conducts an investigation in FreeBSD of the real world…

Operating Systems · Computer Science 2020-08-06 A. H. Bell-Thomas

CBP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling

Reducing the average memory access time is crucial for improving the performance of applications running on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention.…

Hardware Architecture · Computer Science 2021-02-24 Nadja Ramhöj Holtryd , Madhavan Manivannan , Per Stenström , Miquel Pericàs

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-18 Avinash Maurya , Jie Ye , M. Mustafa Rafique , Franck Cappello , Bogdan Nicolae

Intelligent colocation of HPC workloads

Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-17 Felippe V. Zacarias , Vinicius Petrucci , Rajiv Nishtala , Paul Carpenter , Daniel Mossé

Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware

Many modern workloads such as neural network inference and graph processing are fundamentally memory-bound. For such workloads, data movement between memory and CPU cores imposes a significant overhead in terms of both latency and energy. A…

Hardware Architecture · Computer Science 2023-04-04 Juan Gómez-Luna , Izzat El Hajj , Ivan Fernandez , Christina Giannoula , Geraldo F. Oliveira , Onur Mutlu

Systems for Memory Disaggregation: Challenges & Opportunities

Memory disaggregation addresses memory imbalance in a cluster by decoupling CPU and memory allocations of applications while also increasing the effective memory capacity for (memory-intensive) applications beyond the local memory limit…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-02-07 Anil Yelam

In-Network Memory Access: Bridging SmartNIC and Host Memory

SmartNICs have been increasingly utilized across various applications to offload specific computational tasks, thereby enhancing overall system performance. However, this offloading process introduces several communication challenges that…

Networking and Internet Architecture · Computer Science 2025-07-08 Mohammed Zain Farooqi , Masoud Hemmatpour , Tore Heide Larsen

Compositional Memory Systems for Multimedia Communicating Tasks

Conventional cache models are not suited for real-time parallel processing because tasks may flush each other's data out of the cache in an unpredictable manner. In this way the system is not compositional so the overall performance is…

Hardware Architecture · Computer Science 2011-11-09 A. M. Molnos , M. J. M. Heijligers , S. D. Cotofana , J. T. J. Van Eijndhoven

Inter-Layer Per-Mobile Optimization of Cloud Mobile Computing: A Message-Passing Approach

Cloud mobile computing enables the offloading of computation-intensive applications from a mobile device to a cloud processor via a wireless interface. In light of the strong interplay between offloading decisions at the application layer…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-29 Shahrouz Khalili , Osvaldo Simeone