Related papers: Warp-Level Parallelism: Enabling Multiple Replicat…

Embarrassingly Parallel Independent Training of Multi-Layer Perceptrons with Heterogeneous Architectures

The definition of a Neural Network architecture is one of the most critical and challenging tasks to perform. In this paper, we propose ParallelMLPs. ParallelMLPs is a procedure to enable the training of several independent Multilayer…

Machine Learning · Computer Science 2022-06-20 Felipe Costa Farias , Teresa Bernarda Ludermir , Carmelo Jose Albanez Bastos-Filho

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in…

Hardware Architecture · Computer Science 2024-04-10 Diya Joseph , Juan Luis Aragón , Joan-Manuel Parcerisa , Antonio Gonzalez

Parallelizing a modern GPU simulator

Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-27 Rodrigo Huerta , Antonio González

Parallel-in-iteration optimization using multigrid reduction-in-time

Standard gradient-based iteration algorithms for optimization, such as gradient descent and its various proximal-based extensions to nonsmooth problems, are known to converge slowly for ill-conditioned problems, sometimes requiring many…

Numerical Analysis · Mathematics 2026-03-24 G. H. M. Araújo , O. A. Krzysik , H. De Sterck

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

Domain-specific languages that execute image processing pipelineson GPUs, such as Halide and Forma, operate by 1) dividing the image into overlapped tiles, and 2) fusing loops to improve memory locality. However, current approaches have…

Programming Languages · Computer Science 2020-09-09 Abhinav Jangda , Arjun Guha

DistMLIP: A Distributed Inference Platform for Machine Learning Interatomic Potentials

Large-scale atomistic simulations are essential to bridge computational materials and chemistry to realistic materials and drug discovery applications. In the past few years, rapid developments of machine learning interatomic potentials…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-03 Kevin Han , Bowen Deng , Amir Barati Farimani , Gerbrand Ceder

Acceleration for Timing-Aware Gate-Level Logic Simulation with One-Pass GPU Parallelism

Witnessing the advancing scale and complexity of chip design and benefiting from high-performance computation technologies, the simulation of Very Large Scale Integration (VLSI) Circuits imposes an increasing requirement for acceleration…

Data Structures and Algorithms · Computer Science 2023-04-27 Weijie Fang , Yanggeng Fu , Jiaquan Gao , Longkun Guo , Gregory Gutin , Xiaoyan Zhang

Cimple: Instruction and Memory Level Parallelism

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for…

Programming Languages · Computer Science 2018-07-05 Vladimir Kiriansky , Haoran Xu , Martin Rinard , Saman Amarasinghe

SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training

Pipeline Parallelism (PP) serves as a crucial technique for training Large Language Models (LLMs), owing to its capability to alleviate memory pressure from model states with relatively low communication overhead. However, in long-context…

Machine Learning · Computer Science 2025-04-22 Zhouyang Li , Yuliang Liu , Wei Zhang , Tailing Yuan , Bin Chen , Chengru Song , Di Zhang

XPipe: Efficient Pipeline Model Parallelism for Multi-GPU DNN Training

We propose XPipe, an efficient asynchronous pipeline model parallelism approach for multi-GPU DNN training. XPipe is designed to use multiple GPUs to concurrently and continuously train different parts of a DNN model. To improve GPU…

Machine Learning · Computer Science 2020-11-10 Lei Guan , Wotao Yin , Dongsheng Li , Xicheng Lu

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

A Multi-signal Variant for the GPU-based Parallelization of Growing Self-Organizing Networks

Among the many possible approaches for the parallelization of self-organizing networks, and in particular of growing self-organizing networks, perhaps the most common one is producing an optimized, parallel implementation of the standard…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-03-31 Giacomo Parigi , Angelo Stramieri , Danilo Pau , Marco Piastra

Rethinking State-Machine Replication for Parallelism

State-machine replication, a fundamental approach to designing fault-tolerant services, requires commands to be executed in the same order by all replicas. Moreover, command execution must be deterministic: each replica must produce the…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-11-26 Parisa Jalili Marandi , Carlos Eduardo Bezerra , Fernando Pedone

SiDP: Memory-Efficient Data Parallelism for Offline LLM Inference

The rapid adoption of large language models (LLMs) has shifted a substantial portion of inference workloads into throughput-oriented offline regimes, where fully utilizing GPU compute requires large batch sizes. However, existing…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-28 Alan Zhao , Cyril Y. He

Multigrid Reduction in Time for Chaotic Dynamical Systems

As CPU clock speeds have stagnated and high performance computers continue to have ever higher core counts, increased parallelism is needed to take advantage of these new architectures. Traditional serial time-marching schemes can be a…

Numerical Analysis · Mathematics 2022-08-29 David A. Vargas , Robert D. Falgout , Stefanie Günther , Jacob B. Schroder

Parallel Constraint-Driven Inductive Logic Programming

Multi-core machines are ubiquitous. However, most inductive logic programming (ILP) approaches use only a single core, which severely limits their scalability. To address this limitation, we introduce parallel techniques based on…

Artificial Intelligence · Computer Science 2021-09-16 Andrew Cropper , Oghenejokpeme Orhobor , Cristian Dinu , Rolf Morel

Parallelizing Over Artificial Neural Network Training Runs with Multigrid

Artificial neural networks are a popular and effective machine learning technique. Great progress has been made parallelizing the expensive training phase of an individual network, leading to highly specialized pieces of hardware, many…

Numerical Analysis · Computer Science 2017-10-03 Jacob B. Schroder

targetDP: an Abstraction of Lattice Based Parallelism with Portable Performance

To achieve high performance on modern computers, it is vital to map algorithmic parallelism to that inherent in the hardware. From an application developer's perspective, it is also important that code can be maintained in a portable manner…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-20 Alan Gray , Kevin Stratford

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

Graphics Processing Units (GPUs) are widely used by various applications in a broad variety of fields to accelerate their computation but remain susceptible to transient hardware faults (soft errors) that can easily compromise application…

Software Engineering · Computer Science 2021-03-30 Lishan Yang , Bin Nie , Adwait Jog , Evgenia Smirni