Related papers: ACS: Concurrent Kernel Execution on Irregular, Inp…

Exploiting Dependency and Parallelism: Real-Time Scheduling and Analysis for GPU Tasks

With the rapid advancement of Artificial Intelligence, the Graphics Processing Unit (GPU) has become increasingly essential across a growing number of safety-critical application domains. Applying a GPU is indispensable for parallel…

Operating Systems · Computer Science 2026-02-25 Yuanhai Zhang , Songyang He , Ruizhe Gou , Mingyue Cui , Boyang Li , Shuai Zhao , Kai Huang

ACGraph: An Efficient Asynchronous Out-of-Core Graph Processing Framework

Graphs are a ubiquitous data structure in diverse domains such as machine learning, social networks, and data mining. As real-world graphs continue to grow beyond the memory capacity of single machines, out-of-core graph processing systems…

Databases · Computer Science 2025-11-12 Dechuang Chen , Sibo Wang , Qintian Guo

Concurrent Scheduling of High-Level Parallel Programs on Multi-GPU Systems

Parallel programming models can encourage performance portability by moving the responsibility for work assignment and data distribution from the programmer to a runtime system. However, analyzing the resulting implicit memory allocations,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-14 Fabian Knorr , Philip Salzmann , Peter Thoman , Thomas Fahringer

Fast Processing of Large Graph Applications Using Asynchronous Architecture

Graph algorithms and techniques are increasingly being used in scientific and commercial applications to express relations and explore large data sets. Although conventional or commodity computer architectures, like CPU or GPU, can compute…

Hardware Architecture · Computer Science 2017-07-03 Michel A. Kinsy , Rashmi S. Agrawal , Hien D. Nguyen

APEX: Asynchronous Parallel CPU-GPU Execution for Online LLM Inference on Constrained GPUs

Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-16 Jiakun Fan , Yanglin Zhang , Xiangchen Li , Dimitrios S. Nikolopoulos

Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution

Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance…

Machine Learning · Computer Science 2026-02-20 Zhuojin Li , Marco Paolieri , Leana Golubchik

ARCAS: Adaptive Runtime System for Chiplet-Aware Scheduling

The growing disparity between CPU core counts and available memory bandwidth has intensified memory contention in servers. This particularly affects highly parallelizable applications, which must achieve efficient cache utilization to…

Hardware Architecture · Computer Science 2025-03-17 Alessandro Fogli , Bo Zhao , Peter Pietzuch , Jana Giceva

Automatic Parallelization of Sequential Programs

Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-09-21 Peter Kraft , Amos Waterland , Daniel Y Fu , Anitha Gollamudi , Shai Szulanski , Margo Seltzer

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing

Deep learning systems have been successfully applied to Euclidean data such as images, video, and audio. In many applications, however, information and their relationships are better expressed with graphs. Graph Convolutional Networks…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-09-14 Tong Geng , Ang Li , Runbin Shi , Chunshu Wu , Tianqi Wang , Yanfei Li , Pouya Haghi , Antonino Tumeo , Shuai Che , Steve Reinhardt , Martin Herbordt

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

For a deep learning model, efficient execution of its computation graph is key to achieving high performance. Previous work has focused on improving the performance for individual nodes of the computation graph, while ignoring the…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-26 Linpeng Tang , Yida Wang , Theodore L. Willke , Kai Li

Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling

Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clusters and clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-03-22 Jianlong Zhong , Bingsheng He

An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels

Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-06 Nilanjan Goswami , Amer Qouneh , Chao Li , Tao Li

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization

Many emerging cyber-physical systems, such as autonomous vehicles and robots, rely heavily on artificial intelligence and machine learning algorithms to perform important system operations. Since these highly parallel applications are…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-02-07 An Zou , Jing Li , Christopher D. Gill , Xuan Zhang

Static Graph Challenge: Subgraph Isomorphism

The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-07 Siddharth Samsi , Vijay Gadepally , Michael Hurley , Michael Jones , Edward Kao , Sanjeev Mohindra , Paul Monticciolo , Albert Reuther , Steven Smith , William Song , Diane Staheli , Jeremy Kepner

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

The Dynamical Kernel Scheduler - Part 1

Emerging processor architectures such as GPUs and Intel MICs provide a huge performance potential for high performance computing. However developing software using these hardware accelerators introduces additional challenges for the…

Computational Physics · Physics 2016-09-21 Andreas Adelmann , Uldis Locans , Andreas Suter

Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even…

Data Structures and Algorithms · Computer Science 2019-08-22 Laxman Dhulipala , Guy E. Blelloch , Julian Shun

Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks

Graph Convolutional Networks (GCNs) are pivotal in extracting latent information from graph data across various domains, yet their acceleration on mainstream GPUs is challenged by workload imbalance and memory access irregularity. To…

Hardware Architecture · Computer Science 2023-08-24 Xi Xie , Hongwu Peng , Amit Hasan , Shaoyi Huang , Jiahui Zhao , Haowen Fang , Wei Zhang , Tong Geng , Omer Khan , Caiwen Ding

Boosting Performance of Iterative Applications on GPUs: Kernel Batching with CUDA Graphs

Graphics Processing Units (GPUs) have become the standard in accelerating scientific applications on heterogeneous systems. However, as GPUs are getting faster, one potential performance bottleneck with GPU-accelerated applications is the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-01 Jonah Ekelund , Stefano Markidis , Ivy Peng

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The ability to model, analyze, and predict execution time of computations is an important building block supporting numerous efforts, such as load balancing, performance optimization, and automated performance tuning for high performance,…

Performance · Computer Science 2020-06-22 James D. Stevens , Andreas Klöckner