Related papers: Specializing Coherence, Consistency, and Push/Pull…

To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations

We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state.We investigate the applicability of…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-02 Maciej Besta , Michal Podstawski , Linus Groner , Edgar Solomonik , Torsten Hoefler

Characterizing the Performance of Emerging Deep Learning, Graph, and High Performance Computing Workloads Under Interference

Throughput-oriented computing via co-running multiple applications in the same machine has been widely adopted to achieve high hardware utilization and energy saving on modern supercomputers and data centers. However, efficiently co-running…

Performance · Computer Science 2023-03-29 Hao Xu , Shuang Song , Ze Mao

Dynamic Load Balancing Strategies for Graph Applications on GPUs

Acceleration of graph applications on GPUs has found large interest due to the ubiquitous use of graph processing in various domains. The inherent \textit{irregularity} in graph applications leads to several challenges for parallelization.…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-11-02 Ananya Raval , Rupesh Nasre , Vivek Kumar , Vasudevan R , Sathish Vadhiyar , Keshav Pingali

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Dynamic graph neural network (DGNN) is becoming increasingly popular because of its widespread use in capturing dynamic features in the real world. A variety of dynamic graph neural networks designed from algorithmic perspectives have…

Hardware Architecture · Computer Science 2023-04-17 Hanqiu Chen , Yahya Alhinai , Yihan Jiang , Eunjee Na , Cong Hao

Scaling Up Large-Scale Graph Processing for GPU-Accelerated Heterogeneous Systems

Not only with the large host memory for supporting large scale graph processing, GPU-accelerated heterogeneous architecture can also provide a great potential for high-performance computing. However, few existing heterogeneous systems can…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-05 Xianliang Li

An Adaptive Load Balancer For Graph Analytical Applications on GPUs

Load-balancing among the threads of a GPU for graph analytics workloads is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in power-law graphs. We describe a novel…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-28 Vishwesh Jatala , Loc Hoang , Roshan Dathathri , Gurbinder Gill , V Krishna Nandivada , Keshav Pingali

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Connected components and spanning forest are fundamental graph algorithms due to their use in many important applications, such as graph clustering and image segmentation. GPUs are an ideal platform for graph algorithms due to their high…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-08-28 Changwan Hong , Laxman Dhulipala , Julian Shun

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-11 Mhd Ghaith Olabi , Juan Gómez Luna , Onur Mutlu , Wen-mei Hwu , Izzat El Hajj

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

A Graph-Partition-Based Scheduling Policy for Heterogeneous Architectures

In order to improve system performance efficiently, a number of systems choose to equip multi-core and many-core processors (such as GPUs). Due to their discrete memory these heterogeneous architectures comprise a distributed system within…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-02-27 Hao Wu , Daniel Lohmann , Wolfgang Schröder-Preikschat

A Benchmark on Directed Graph Representation Learning in Hardware Designs

To keep pace with the rapid advancements in design complexity within modern computing systems, directed graph representation learning (DGRL) has become crucial, particularly for encoding circuit netlists, computational graphs, and…

Machine Learning · Computer Science 2024-10-10 Haoyu Wang , Yinan Huang , Nan Wu , Pan Li

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-02 Gabin Schieffer , Ruimin Shi , Stefano Markidis , Andreas Herten , Jennifer Faj , Ivy Peng

Exploring Memory Persistency Models for GPUs

Given its high integration density, high speed, byte addressability, and low standby power, non-volatile or persistent memory is expected to supplement/replace DRAM as main memory. Through persistency programming models (which define…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-30 Zhen Lin , Mohammad Alshboul , Yan Solihin , Huiyang Zhou

GraphFLEx: Structure Learning Framework for Large Expanding Graphs

Graph structure learning is a core problem in graph-based machine learning, essential for uncovering latent relationships and ensuring model interpretability. However, most existing approaches are ill-suited for large-scale and dynamically…

Machine Learning · Computer Science 2025-05-20 Mohit Kataria , Nikita Malik , Sandeep Kumar , Jayadeva

Data-regularized Reinforcement Learning for Diffusion Models at Scale

Aligning generative diffusion models with human preferences via reinforcement learning (RL) is critical yet challenging. Most existing algorithms are often vulnerable to reward hacking, such as quality degradation, over-stylization, or…

Machine Learning · Computer Science 2025-12-25 Haotian Ye , Kaiwen Zheng , Jiashu Xu , Puheng Li , Huayu Chen , Jiaqi Han , Sheng Liu , Qinsheng Zhang , Hanzi Mao , Zekun Hao , Prithvijit Chattopadhyay , Dinghao Yang , Liang Feng , Maosheng Liao , Junjie Bai , Ming-Yu Liu , James Zou , Stefano Ermon

DeFoG: Discrete Flow Matching for Graph Generation

Graph generative models are essential across diverse scientific domains by capturing complex distributions over relational data. Among them, graph diffusion models achieve superior performance but face inefficient sampling and limited…

Machine Learning · Computer Science 2025-06-17 Yiming Qin , Manuel Madeira , Dorina Thanou , Pascal Frossard

Puzzle: Scheduling Multiple Deep Learning Models on Mobile Device with Heterogeneous Processors

As deep learning models are increasingly deployed on mobile devices, modern mobile devices incorporate deep learning-specific accelerators to handle the growing computational demands, thus increasing their hardware heterogeneity. However,…

Machine Learning · Computer Science 2025-08-26 Duseok Kang , Yunseong Lee , Junghoon Kim

Exploring Thread Coarsening on FPGA

Over the past few years, there has been an increased interest in including FPGAs in data centers and high-performance computing clusters along with GPUs and other accelerators. As a result, it has become increasingly important to have a…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-09-14 Mostafa Eghbali Zarch , Reece Neff , Michela Becchi

Technical Report: Benefits of Stabilization versus Rollback in Self-Stabilizing Graph-Based Applications on Eventually Consistent Key-Value Stores

In this paper, we evaluate and compare the performance of two approaches, namely self-stabilization and rollback, to handling consistency violating faults (\cvf) that occur when a self-stabilizing distributed graph-based program is executed…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-07-29 Duong Nguyen , Sandeep S. Kulkarni

Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study

Efficiency is essential to support responsiveness w.r.t. ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code that supports symbolic, graph-based Deep…

Software Engineering · Computer Science 2022-07-20 Tatiana Castro Vélez , Raffi Khatchadourian , Mehdi Bagherzadeh , Anita Raja