Related papers: DuctTeip: An efficient programming model for distr…
As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…
In this paper, we derive and investigate approaches to dynamically load balance a distributed task parallel application software. The load balancing strategy is based on task migration. Busy processes export parts of their ready task queue…
The parallel and distributed processing are becoming de facto industry standard, and a large part of the current research is targeted on how to make computing scalable and distributed, dynamically, without allocating the resources on…
Real-time scheduling and locking protocols are fundamental facilities to construct time-critical systems. For parallel real-time tasks, predictable locking protocols are required when concurrent sub-jobs mutually exclusive access to shared…
Arrival of multicore systems has enforced a new scenario in computing, the parallel and distributed algorithms are fast replacing the older sequential algorithms, with many challenges of these techniques. The distributed algorithms provide…
Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism; while the latter relies on fine-grained synchronization among…
One typical use case of large-scale distributed computing in data centers is to decompose a computation job into many independent tasks and run them in parallel on different machines, sometimes known as the "embarrassingly parallel"…
Modern large-scale scientific applications consist of thousands to millions of individual tasks. These tasks involve not only computation but also communication with one another. Typically, the communication pattern between tasks is sparse…
This article describes a geometric partitioning software that can be used for quick computation of data partitions on many-core HPC machines. It is most suited for dynamic applications with load distributions that vary with time.…
We introduce a new model for the task mapping problem to aid in the systematic design of algorithms for heterogeneous systems including, but not limited to, CPUs, GPUs and FPGAs. A special focus is set on the communication between the…
The ability to express a program as a hierarchical composition of parts is an essential tool in managing the complexity of software and a key abstraction this provides is to separate the representation of data from the computation. Many…
We provide algorithms for efficiently addressing quantum memory in parallel. These imply that the standard circuit model can be simulated with low overhead by the more realistic model of a distributed quantum computer. As a result, the…
Task-based programming models have proven to be a robust and versatile way to approach development of applications for distributed environments. They provide natural programming patterns with high performance. However, execution on this…
This paper investigates co-scheduling algorithms for processing a set of parallel applications. Instead of executing each application one by one, using a maximum degree of parallelism for each of them, we aim at scheduling several…
Scheduling query execution plans is a particularly complex problem in shared-nothing parallel systems, where each site consists of a collection of local time-shared (e.g., CPU(s) or disk(s)) and space-shared (e.g., memory) resources and…
We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and…
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extreme-scale applications. However, load balancing is still a challenge in task-based dataflow runtimes. In this paper, we…
The aim of the paper is to introduce general techniques in order to optimize the parallel execution time of sorting on a distributed architectures with processors of various speeds. Such an application requires a partitioning step. For…
The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields.…
Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the…