Related papers: Scheduling data flow program in xkaapi: A new affi…
With growing deployment of Internet of Things (IoT) and machine learning (ML) applications, which need to leverage computation on edge and cloud resources, it is important to develop algorithms and tools to place these distributed…
The analysis of massive scientific data often happens in the form of workflows with interdependent tasks. When such a scientific workflow needs to be scheduled on a parallel or distributed system, one usually represents the workflow as a…
In order to improve system performance efficiently, a number of systems choose to equip multi-core and many-core processors (such as GPUs). Due to their discrete memory these heterogeneous architectures comprise a distributed system within…
Current approaches to scheduling workloads on heterogeneous systems with specialized accelerators often rely on manual partitioning, offloading tasks with specific compute patterns to accelerators. This method requires extensive…
Performance-, power-, and energy-aware scheduling techniques play an essential role in optimally utilizing processing elements (PEs) of heterogeneous systems. List schedulers, a class of low-complexity static schedulers, have commonly been…
We study the problem of executing an application represented by a precedence task graph on a parallel machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the…
Heterogeneous architectures can deliver higher performance and energy efficiency than symmetric counterparts by using multiple architectures tuned to different types of workloads. While previous works focused on CPUs, this work extends the…
Non-uniform performance and power consumption across the processing elements (PEs) of heterogeneous SoCs increase the computation complexity of the task scheduling problem compared to homogeneous architectures. Latency of a software-based…
Accelerator-based heterogeneous architectures, such as CPU-GPU, CPU-TPU, and CPU-FPGA systems, are widely adopted to support the popular artificial intelligence (AI) algorithms that demand intensive computation. When deployed in real-time…
With the rapid evolution of GPU architectures, the heterogeneity of model training infrastructures is steadily increasing. In such environments, effectively utilizing all available heterogeneous accelerators becomes critical for distributed…
The escalating adoption of diffusion models for applications such as image generation demands efficient parallel inference techniques to manage their substantial computational cost. However, existing diffusion parallelism inference schemes…
This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…
Diffusion models have achieved remarkable progress in high-fidelity image, video, and audio generation, yet inference remains computationally expensive. Nevertheless, current diffusion acceleration methods based on distributed parallelism…
In latency-sensitive applications, efficient task scheduling is crucial for maintaining Quality of Service (QoS) while meeting strict timing constraints. This paper addresses the challenge of scheduling periodic tasks structured as directed…
Heterogeneous computing can potentially offer significant performance and performance per watt improvements over homogeneous computing, but the question "what is the ideal mapping of algorithms to architectures?" remains an open one. In the…
To deliver high performance in power limited systems, architects have turned to using heterogeneous systems, either CPU+GPU or mixed CPU-hardware systems. However, in systems with different processor types and task affinities, scheduling…
The widely-adopted practice is to train deep learning models with specialized hardware accelerators, e.g., GPUs or TPUs, due to their superior performance on linear algebra operations. However, this strategy does not employ effectively the…
Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution…
The scheduling of task graphs with communication delays has been extensively studied. Recently, new results for the common sub-case of fork-join shaped task graphs were published, including an EPTAS and polynomial algorithms for special…
Modern platforms are using accelerators in conjunction with standard processing units in order to reduce the running time of specific operations, such as matrix operations, and improve their performance. Scheduling on such hybrid platforms…