Related papers: GPU Implementations for Midsize Integer Addition a…

A GPU Based Memory Optimized Parallel Method For FFT Implementation

FFT (fast Fourier transform) plays a very important role in many fields, such as digital signal processing, digital image processing and so on. However, in application, FFT becomes a factor of affecting the processing efficiency, especially…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-07-25 Fan Zhang , Chen Hu , Qiang Yin , Wei Hu

Improving the performance of the linear systems solvers using CUDA

Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-11-24 Bogdan Oancea , Tudorel Andrei , Raluca Mariana Dragoescu

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

Faster and Cheaper: Parallelizing Large-Scale Matrix Factorization on GPUs

Matrix factorization (MF) is employed by many popular algorithms, e.g., collaborative filtering. The emerging GPU technology, with massively multicore and high intra-chip memory bandwidth but limited memory capacity, presents an opportunity…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-25 Wei Tan , Liangliang Cao , Liana Fong

GPU-accelerated factorization sets in numerical semigroups via parallel bounded lexicographic streams

We describe a method for parallelizing the lexicographic enumeration algorithm for the factorization set of an element in a numerical semigroup via bounds. This enables the use of GPU and distributed computing methods. We provide a CUDA…

Commutative Algebra · Mathematics 2024-05-14 Thomas Barron

GPGPU Processing in CUDA Architecture

The future of computation is the Graphical Processing Unit, i.e. the GPU. The promise that the graphics cards have shown in the field of image processing and accelerated rendering of 3D scenes, and the computational capability that these…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-02-21 Jayshree Ghorpade , Jitendra Parande , Madhura Kulkarni , Amit Bawaskar

GPU-Accelerated Primal Heuristics for Mixed Integer Programming

We introduce a fusion of GPU accelerated primal heuristics for Mixed Integer Programming. Leveraging GPU acceleration enables exploration of larger search regions and faster iterations. A GPU-accelerated PDLP serves as an approximate LP…

Optimization and Control · Mathematics 2025-10-31 Akif Çördük , Piotr Sielski , Alice Boucher , Kumar Aatish

Graphics Processing Units and High-Dimensional Optimization

This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many…

Computation · Statistics 2015-03-13 Hua Zhou , Kenneth Lange , Marc A. Suchard

Tiling for Performance Tuning on Different Models of GPUs

The strategy of using CUDA-compatible GPUs as a parallel computation solution to improve the performance of programs has been more and more widely approved during the last two years since the CUDA platform was released. Its benefit extends…

Distributed, Parallel, and Cluster Computing · Computer Science 2010-01-12 Chang Xu , Steven R. Kirk , Samantha Jenkins

Mixed precision in Graphics Processing Unit

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…

Hardware Architecture · Computer Science 2021-10-26 Quentin Gallouédec

Simultaneous Solving of Batched Linear Programs on a GPU

Linear Programs (LPs) appear in a large number of applications and offloading them to a GPU is viable to gain performance. Existing work on offloading and solving an LP on a GPU suggests that there is performance gain generally on large…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Amit Gurung , Rajarshi Ray

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The…

Mathematical Software · Computer Science 2013-04-29 Chetan Jhurani , Paul Mullowney

A Preliminary Study on Accelerating Simulation Optimization with GPU Implementation

We provide a preliminary study on utilizing GPU (Graphics Processing Unit) to accelerate computation for three simulation optimization tasks with either first-order or second-order algorithms. Compared to the implementation using only CPU…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-19 Jinghai He , Haoyu Liu , Yuhang Wu , Zeyu Zheng , Tingyu Zhu

Extensions and Limitations of the Neural GPU

The Neural GPU is a recent model that can learn algorithms such as multi-digit binary addition and binary multiplication in a way that generalizes to inputs of arbitrary length. We show that there are two simple ways of improving the…

Neural and Evolutionary Computing · Computer Science 2016-11-08 Eric Price , Wojciech Zaremba , Ilya Sutskever

GPU accelerated matrix factorization of large scale data using block based approach

Matrix Factorization (MF) on large scale data takes substantial time on a Central Processing Unit (CPU). While Graphical Processing Unit (GPU)s could expedite the computation of MF, the available memory on a GPU is finite. Leveraging GPUs…

Machine Learning · Computer Science 2023-04-28 Prasad Bhavana , Vineet Padmanabhan

Enabling predictable parallelism in single-GPU systems with persistent CUDA threads

Graphics Processing Unit, or GPUs, have been successfully adopted both for graphic computation in 3D applications, and for general purpose application (GP-GPUs), thank to their tremendous performance-per-watt. Recently, there is a big…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-10-03 Paolo Burgio

Numerical integration on GPUs for higher order finite elements

The paper considers the problem of implementation on graphics processors of numerical integration routines for higher order finite element approximations. The design of suitable GPU kernels is investigated in the context of general purpose…

Mathematical Software · Computer Science 2014-03-03 Krzysztof Banaś , Przemysław Płaszewski , Paweł Macioł

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies

GPU-embedded systems have gained popularity across various domains due to their efficient power consumption. However, in order to meet the demands of real-time or time-consuming applications running on these systems, it is crucial for them…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-11-17 Adrian Perez Dieguez , Margarita Amor Lopez

A Fast and Generic GPU-Based Parallel Reduction Implementation

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-23 Walid Jradi , Hugo do Nascimento , Wellington Martins