Related papers: Streaming Message Interface: High-Performance Dist…

FMI: Fast and Cheap Message Passing for Serverless Functions

Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed jobs that benefit from large-scale and dynamic parallelism, the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-05-16 Marcin Copik , Roman Böhringer , Alexandru Calotoiu , Torsten Hoefler

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs, and high-speed network interconnects. Communication libraries supporting efficient data transfers involving memory buffers from the GPU memory typically require the…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-29 Naveen Namashivayam , Krishna Kandalla , James B White , Larry Kaplan , Mark Pagel

SME: A High Productivity FPGA Tool for Software Programmers

For several decades, the CPU has been the standard model to use in the majority of computing. While the CPU does excel in some areas, heterogeneous computing, such as reconfigurable hardware, is showing increasing potential in areas like…

Hardware Architecture · Computer Science 2021-04-21 Carl-Johannes Johnsen , Alberte Thegler , Kenneth Skovhede , Brian Vinter

sPIN: High-performance streaming Processing in the Network

Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-20 Torsten Hoefler , Salvatore Di Girolamo , Konstantin Taranov , Ryan E. Grant , Ron Brightwell

Performance Evaluation of Parallel Message Passing and Thread Programming Model on Multicore Architectures

The current trend of multicore architectures on shared memory systems underscores the need of parallelism. While there are some programming model to express parallelism, thread programming model has become a standard to support these system…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-12-13 D. T. Hasta , A. B. Mutiara

MIMS: Towards a Message Interface based Memory System

Memory system is often the main bottleneck in chipmultiprocessor (CMP) systems in terms of latency, bandwidth and efficiency, and recently additionally facing capacity and power problems in an era of big data. A lot of research works have…

Hardware Architecture · Computer Science 2014-04-10 Licheng Chen , Tianyue Lu , Yanan Wang , Mingyu Chen , Yuan Ruan , Zehan Cui , Yongbing Huang , Mingyang Chen , Jiutian Zhang , Yungang Bao

The PetscSF Scalable Communication Layer

PetscSF, the communication component of the Portable, Extensible Toolkit for Scientific Computation (PETSc), is designed to provide PETSc's communication infrastructure suitable for exascale computers that utilize GPUs and other…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-05-25 Junchao Zhang , Jed Brown , Satish Balay , Jacob Faibussowitsch , Matthew Knepley , Oana Marin , Richard Tran Mills , Todd Munson , Barry F. Smith , Stefano Zampini

MLI: An API for Distributed Machine Learning

MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of…

Machine Learning · Computer Science 2013-10-29 Evan R. Sparks , Ameet Talwalkar , Virginia Smith , Jey Kottalam , Xinghao Pan , Joseph Gonzalez , Michael J. Franklin , Michael I. Jordan , Tim Kraska

PGMPI: Automatically Verifying Self-Consistent MPI Performance Guidelines

The Message Passing Interface (MPI) is the most commonly used application programming interface for process communication on current large-scale parallel systems. Due to the scale and complexity of modern parallel architectures, it is…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-05 Sascha Hunold , Alexandra Carpen-Amarie , Felix Donatus Lübbe , Jesper Larsson Träff

Software-Distributed Shared Memory for Heterogeneous Machines: Design and Use Considerations

Distributed shared memory (DSM) allows to implement and deploy applications onto distributed architectures using the convenient shared memory programming model in which a set of tasks are able to allocate and access data despite their…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-04 Loïc Cudennec

MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface

Application development for distributed computing "Grids" can benefit from tools that variously hide or enable application-level management of critical aspects of the heterogeneous environment. As part of an investigation of these issues,…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 N. T. Karonis , B. Toonen , I. Foster

Modeling and Simulation of Spark Streaming

As more and more devices connect to Internet of Things, unbounded streams of data will be generated, which have to be processed "on the fly" in order to trigger automated actions and deliver real-time services. Spark Streaming is a popular…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-12 Jia-Chun Lin , Ming-Chang Lee , Ingrid Chieh Yu , Einar Broch Johnsen

The Distributed Network Processor: a novel off-chip and on-chip interconnection network architecture

One of the most demanding challenges for the designers of parallel computing architectures is to deliver an efficient network infrastructure providing low latency, high bandwidth communications while preserving scalability. Besides off-chip…

Hardware Architecture · Computer Science 2012-03-08 Andrea Biagioni , Francesca Lo Cicero , Alessandro Lonardo , Pier Stanislao Paolucci , Mersia Perra , Davide Rossetti , Carlo Sidore , Francesco Simula , Laura Tosoratto , Piero Vicini

A Software Parallel Programming Approach to FPGA-Accelerated Computing

This paper introduces an effort to incorporate reconfigurable logic (FPGA) components into a software programming model. For this purpose, we have implemented a hardware engine for remote memory communication between hardware computation…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-08-22 Ruediger Willenberg , Paul Chow

Parallel Paradigms in Modern HPC: A Comparative Analysis of MPI, OpenMP, and CUDA

This paper presents a comprehensive comparison of three dominant parallel programming models in High Performance Computing (HPC): Message Passing Interface (MPI), Open Multi-Processing (OpenMP), and Compute Unified Device Architecture…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-06-19 Nizar ALHafez , Ahmad Kurdi

Shared Memory-Aware Latency-Sensitive Message Aggregation for Fine-Grained Communication

Message aggregation is often used with a goal to reduce communication cost in HPC applications. The difference in the order of overhead of sending a message and cost of per byte transferred motivates the need for message aggregation, for…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-11-07 Kavitha Chandrasekar , Laxmikant Kale

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts…

Hardware Architecture · Computer Science 2025-04-25 Anastasiia Ruzhanskaia , Pengcheng Xu , David Cock , Timothy Roscoe

Learning from the Success of MPI

The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. This success has occurred in spite of the view of many that message passing is difficult and that other…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 William D. Gropp

User-level DSM System for Modern High-Performance Interconnection Networks

In this paper, we introduce a new user-level DSM system which has the ability to directly interact with underlying interconnection networks. The DSM system provides the application programmer a flexible API to program parallel applications…

Distributed, Parallel, and Cluster Computing · Computer Science 2007-05-23 Bharath Ramesh , Srinidhi Varadarajan

NetDAM: Network Direct Attached Memory with Programmable In-Memory Computing ISA

Data-intensive applications like distributed AI-training may require multi-terabytes memory capacity with multi-terabits bandwidth. We directly attach the memory to the ethernet controller with some programable logic to design an efficient…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-10-29 Kevin Fang , David Peng