Related papers: An Interrupt-Driven Work-Sharing For-Loop Schedule…
Work-stealing is a widely used technique for balancing irregular parallel workloads, and most modern runtime systems adopt lock-free work-stealing deques to reduce contention and improve scalability. However, existing algorithms are…
Work Stealing has been a very successful algorithm for scheduling parallel computations, and is known to achieve high performances even for computations exhibiting fine-grained parallelism. We present a variant of \ws\ that provably avoids…
Work-stealing is a popular technique to implement dynamic load balancing in a distributed manner. In this approach, each process owns a set of tasks that have to be executed. The owner of the set can put tasks in it and can take tasks from…
Many shared-memory parallel irregular applications, such as sparse linear algebra and graph algorithms, depend on efficient loop scheduling (LS) in a fork-join manner despite that the work per loop iteration can greatly vary depending on…
Parallelism has become extremely popular over the past decade, and there have been a lot of new parallel algorithms and software. The randomized work-stealing (RWS) scheduler plays a crucial role in this ecosystem. In this paper, we study…
Algorithms for scheduling structured parallel computations have been widely studied in the literature. For some time now, Work Stealing is one of the most popular for scheduling such computations, and its performance has been studied in…
The intelligent Distributed Dispatch and Scheduling (iDDS) service is a versatile workflow orchestration system designed for large-scale, distributed scientific computing. iDDS extends traditional workload and data management by integrating…
To improve the utility of learning applications and render machine learning solutions feasible for complex applications, a substantial amount of heavy computations is needed. Thus, it is essential to delegate the computations among several…
The main goal of parallel processing is to provide users with performance that is much better than that of single processor systems. The execution of jobs is scheduled, which requires certain resources in order to meet certain criteria.…
Scheduling is an important task allowing parallel systems to perform efficiently and reliably. For modern computation systems, divisible load is a special type of data which can be divided into arbitrary sizes and independently processed in…
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual…
The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "$\parallel$" (parallel) and "$;$" (serial), are insufficient in expressing "partial dependencies" or…
The augmented scale and complexity of urban transportation networks have significantly increased the execution time and resource requirements of vehicular network simulations, exceeding the capabilities of sequential simulators. The need…
The convergence of high-performance computing (HPC) and artificial intelligence (AI) is driving the emergence of increasingly complex parallel applications and workloads. These workloads often combine multiple parallel runtimes within the…
Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per…
We consider shared-object systems that require their threads to fulfill the system jobs by first acquiring sequentially the objects needed for the jobs and then holding on to them until the job completion. Such systems are in the core of a…
Task parallelism is designed to simplify the task of parallel programming. When executing a task parallel program on modern NUMA architectures, it can fail to scale due to the phenomenon called work inflation, where the overall processing…
Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism; while the latter relies on fine-grained synchronization among…
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extreme-scale applications. However, load balancing is still a challenge in task-based dataflow runtimes. In this paper, we…
Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several…