English
Related papers

Related papers: A Fast and Generic GPU-Based Parallel Reduction Im…

200 papers

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

The focus of my PhD thesis is on exploring parallel approaches to efficiently solve problems modeled by constraints and presenting a new proposal. Current solvers are very advanced; they are carefully designed to effectively manage the…

Artificial Intelligence · Computer Science 2019-09-23 Fabio Tardivo

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-26 Pierre Talbot , Frédéric Pinel , Pascal Bouvry

Matlab is very widely used in scientific computing, but Matlab computational efficiency is lower than C language program. In order to improve the computing speed, some toolbox can use GPU to accelerate the computation. This paper describes…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-26 Mingzhe Wang , Bo Wang , Qiu He , Xiuxiu Liu , Kunshuai Zhu

The main objective of this work consists in analyzing sub-structuring method for the parallel solution of sparse linear systems with matrices arising from the discretization of partial differential equations such as finite element, finite…

Numerical Analysis · Mathematics 2021-08-31 Abal-Kassim Cheik Ahamed , Frédéric Magoulès

Many techniques in program synthesis, superoptimization, and array programming require parallel rollouts of general-purpose programs. GPUs, while capable targets for domain-specific parallelism, are traditionally underutilized by such…

Programming Languages · Computer Science 2026-04-15 Breandan Considine

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Once, researchers and practitioners noticed the potential of using GPU for general…

Numerical Analysis · Computer Science 2016-07-12 K. Parand , Saeed Zafarvahedian , Sayyed A. Hossayni

GPUs have significantly accelerated first-order methods for large-scale optimization, especially in continuous optimization. However, this success has not transferred cleanly to problems with discrete variables, combinatorial structure, and…

Machine Learning · Computer Science 2026-05-22 Jiachang Liu , Andrea Lodi

We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce,…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-09-16 M. Aldinucci , M. Danelutto , M. Drocco , P. Kilpatrick , C. Misale , G. Peretti Pezzi , M. Torquati

In recent years, mesh subdivision---the process of forging smooth free-form surfaces from coarse polygonal meshes---has become an indispensable production instrument. Although subdivision performance is crucial during simulation, animation…

Graphics · Computer Science 2019-01-17 Daniel Mlakar , Martin Winter , Hans-Peter Seidel , Markus Steinberger , Rhaleb Zayer

Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Bogdan Oancea , Tudorel Andrei , Raluca Mariana Dragoescu

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-17 Cristóbal A. Navarro , Roberto Carrasco , Ricardo J. Barrientos , Javier A. Riquelme , Raimundo Vega

In this work, we present an extension of Gaussian process (GP) models with sophisticated parallelization and GPU acceleration. The parallelization scheme arises naturally from the modular computational structure w.r.t. datapoints in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-10-21 Zhenwen Dai , Andreas Damianou , James Hensman , Neil Lawrence

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-11 Mhd Ghaith Olabi , Juan Gómez Luna , Onur Mutlu , Wen-mei Hwu , Izzat El Hajj

Real-time trajectory optimization for nonlinear constrained autonomous systems is critical and typically performed by CPU-based sequential solvers. Specifically, reliance on global sparse linear algebra or the serial nature of dynamic…

Robotics · Computer Science 2026-03-13 Yilin Zou , Zhong Zhang , Maxime Robic , Fanghua Jiang

Parallel computing is a standard approach to achieving high-performance computing (HPC). Three commonly used methods to implement parallel computing include: 1) applying multithreading technology on single-core or multi-core CPUs; 2)…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-09-18 Xinyao Yi

Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, AllReduce algorithms are delayed by the…

Machine Learning · Computer Science 2025-09-30 Arjun Devraj , Eric Ding , Abhishek Vijaya Kumar , Robert Kleinberg , Rachee Singh
‹ Prev 1 2 3 10 Next ›