Related papers: Same Engine, Multiple Gears: Parallelizing Fixpoin…
In this paper we analyze, evaluate, and improve the performance of training generalized linear models on modern CPUs. We start with a state-of-the-art asynchronous parallel training algorithm, identify system-level performance bottlenecks,…
Parallel parameterized complexity theory studies how fixed-parameter tractable (fpt) problems can be solved in parallel. Previous theoretical work focused on parallel algorithms that are very fast in principle, but did not take into account…
Harnessing parallelism in seemingly sequential models is a central challenge for modern machine learning. Several approaches have been proposed for evaluating sequential processes in parallel using iterative fixed-point methods, like…
This paper initiates the studies of parallel algorithms for core maintenance in dynamic graphs. The core number is a fundamental index reflecting the cohesiveness of a graph, which are widely used in large-scale graph analytics. The core…
Supercomputers are equipped with an increasingly large number of cores to use computational power as a way of solving problems that are otherwise intractable. Unfortunately, getting serial algorithms to run in parallel to take advantage of…
Sequential models, such as Recurrent Neural Networks and Neural Ordinary Differential Equations, have long suffered from slow training due to their inherent sequential nature. For many years this bottleneck has persisted, as many thought…
Many tasks in data mining and related fields can be formalized as matching between objects in two heterogeneous domains, including collaborative filtering, link prediction, image tagging, and web search. Machine learning techniques,…
Previous work has shown that there are two major complexity barriers in the synthesis of fault-tolerant distributed programs: (1) generation of fault-span, the set of states reachable in the presence of faults, and (2) resolving deadlock…
Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs…
In recent years, various means of efficiently detecting changepoints in the univariate setting have been proposed, with one popular approach involving minimising a penalised cost function using dynamic programming. In some situations, these…
An efficient and flexible engine for computing fixed points is critical for many practical applications. In this paper, we firstly present a goal-directed fixed point computation strategy in the logic programming paradigm. The strategy…
Two characteristics that make convex decomposition algorithms attractive are simplicity of operations and generation of parallelizable structures. In principle, these schemes require that all coordinates update at the same time, i.e., they…
Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…
The problem of Bayesian filtering and smoothing in nonlinear models with additive noise is an active area of research. Classical Taylor series as well as more recent sigma-point based methods are two well-known strategies to deal with these…
We study parallel algorithms for the minimisation and equivalence checking of Deterministic Finite Automata (DFAs). Regarding DFA minimisation, we implement four different massively parallel algorithms on Graphics Processing Units~(GPUs).…
To put static program analysis at the fingertips of the software developer, we propose a framework for interactive abstract interpretation. While providing sound analysis results, abstract interpretation in general can be quite costly. To…
We provide a multilevel approach for analysing performances of parallel algorithms. The main outcome of such approach is that the algorithm is described by using a set of operators which are related to each other according to the problem…
Computation of a signal's estimated covariance matrix is an important building block in signal processing, e.g., for spectral estimation. Each matrix element is a sum of products of elements in the input matrix taken over a sliding window.…
There are billions of lines of sequential code inside nowadays' software which do not benefit from the parallelism available in modern multicore architectures. Automatically parallelizing sequential code, to promote an efficient use of the…
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…