Related papers: Automatic Parallelization of Python Programs for D…
The last improvements in programming languages, programming models, and frameworks have focused on abstracting the users from many programming issues. Among others, recent programming frameworks include simpler syntax, automatic memory…
As the artificial intelligence community advances into the era of large models with billions of parameters, distributed training and inference have become essential. While various parallelism strategies-data, model, sequence, and…
This article presents an automatic approach to quickly derive a good solution for hardware resource partition and task granularity for task-based parallel applications on heterogeneous many-core architectures. Our approach employs a…
Sparse, irregular graphs show up in various applications like linear algebra, machine learning, engineering simulations, robotic control, etc. These graphs have a high degree of parallelism, but their execution on parallel threads of modern…
This paper describes Plumbing for Optimization with Asynchronous Parallelism (POAP) and the Python Surrogate Optimization Toolbox (pySOT). POAP is an event-driven framework for building and combining asynchronous optimization strategies,…
Prior work on Automatically Scalable Computation (ASC) suggests that it is possible to parallelize sequential computation by building a model of whole-program execution, using that model to predict future computations, and then…
PARyOpt is a python based implementation of the Bayesian optimization routine designed for remote and asynchronous function evaluations. Bayesian optimization is especially attractive for computational optimization due to its low cost…
The aim of the paper is to introduce general techniques in order to optimize the parallel execution time of sorting on a distributed architectures with processors of various speeds. Such an application requires a partitioning step. For…
The rapid adoption of large language models and multimodal foundation models has made multimodal data preparation pipelines critical AI infrastructure. These pipelines interleave CPU-heavy preprocessing with accelerator-backed (GPU/NPU/TPU)…
With growing deployment of Internet of Things (IoT) and machine learning (ML) applications, which need to leverage computation on edge and cloud resources, it is important to develop algorithms and tools to place these distributed…
There are billions of lines of sequential code inside nowadays' software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the…
The rapid growth of large language models (LLMs) and the continuous release of new GPU products have significantly increased the demand for distributed training across heterogeneous GPU environments. In this paper, we present a…
Randomized parallel algorithms for many fundamental problems achieve optimal linear work in expectation, but upgrading this guarantee to hold with high probability (whp) remains a recurring theoretical challenge. In this paper, we address…
Heterogeneous computing is becoming mainstream in all scopes. This new era in computer architecture brings a new paradigm called Accelerator Level Parallelism (ALP). In ALP, accelerators are used concurrently to provide unprecedented levels…
Detecting parallelizable code regions is a challenging task, even for experienced developers. Numerous recent studies have explored the use of machine learning for code analysis and program synthesis, including parallelization, in light of…
We present EvoSort, a general-purpose adaptive parallel parallel sorting framework accessible at the Python level. EvoSort employs a Genetic Algorithm (GA) to automatically discover and refine critical parameters, including insertion sort…
In this paper, we present several improvements in the parallelization of the in-place merge algorithm, which merges two contiguous sorted arrays into one with an O(T) space complexity (where T is the number of threads). The approach divides…
We present a novel characterization of the mapping of multiple parallelism forms (e.g. data and model parallelism) onto hierarchical accelerator systems that is hierarchy-aware and greatly reduces the space of software-to-hardware mapping.…
Manual parallelization of code remains a significant challenge due to the complexities of modern software systems and the widespread adoption of multi-core architectures. This paper introduces OMPar, an AI-driven tool designed to automate…
Purpose: Optoacoustic tomography (OAT) is inherently a three-dimensional (3D) inverse problem. However, most studies of OAT image reconstruction still employ two-dimensional (2D) imaging models. One important reason is because 3D image…