Related papers: A Fast and Generic GPU-Based Parallel Reduction Im…
We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…
The focus of my PhD thesis is on exploring parallel approaches to efficiently solve problems modeled by constraints and presenting a new proposal. Current solvers are very advanced; they are carefully designed to effectively manage the…
The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…
GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…
The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason…
Matlab is very widely used in scientific computing, but Matlab computational efficiency is lower than C language program. In order to improve the computing speed, some toolbox can use GPU to accelerate the computation. This paper describes…
The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…
Many techniques in program synthesis, superoptimization, and array programming require parallel rollouts of general-purpose programs. GPUs, while capable targets for domain-specific parallelism, are traditionally underutilized by such…
Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…
Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Once, researchers and practitioners noticed the potential of using GPU for general…
GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…
We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce,…
In recent years, mesh subdivision---the process of forging smooth free-form surfaces from coarse polygonal meshes---has become an indispensable production instrument. Although subdivision performance is crucial during simulation, animation…
Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core…
This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running…
In this work, we present an extension of Gaussian process (GP) models with sophisticated parallelization and GPU acceleration. The parallelization scheme arises naturally from the modular computational structure w.r.t. datapoints in the…
Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…
Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…
Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…
Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the…