Related papers: Computationally Efficient Implementation of a Hamm…
In this paper we present an optimized parallel implementation of a flexible MAP decoder for synchronization error correcting codes, supporting a very wide range of code sizes and channel conditions. On mid-range GPUs we demonstrate decoding…
We present a highly parallel implementation of the cross-correlation of time-series data using graphics processing units (GPUs), which is scalable to hundreds of independent inputs and suitable for the processing of signals from "Large-N"…
This paper describes a parallel implementation of Viterbi decoding algorithm. Viterbi decoder is widely used in many state-of-the-art wireless systems. The proposed solution optimizes both throughput and memory usage by applying…
Generation of optimal codes is a well known problem in coding theory. Many computational approaches exist in the literature for finding record breaking codes. However generating codes with long lengths $n$ using serial algorithms is…
We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…
Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental operation in graph computing and analytics. However, the irregularity of real-world graphs poses significant challenges to achieving efficient SpMM operation for graph data on…
Graphics Processing Unit, or GPUs, have been successfully adopted both for graphic computation in 3D applications, and for general purpose application (GP-GPUs), thank to their tremendous performance-per-watt. Recently, there is a big…
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many…
Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate…
Parsing is essential for a wide range of use cases, such as stream processing, bulk loading, and in-situ querying of raw data. Yet, the compute-intense step often constitutes a major bottleneck in the data ingestion pipeline, since parsing…
The alternating direction method of multipliers (ADMM) is a powerful operator splitting technique for solving structured convex optimization problems. Due to its relatively low per-iteration computational cost and ability to exploit…
The Kernel Polynomial Method (KPM) is one of the fast diagonalization methods used for simulations of quantum systems in research fields of condensed matter physics and chemistry. The algorithm has a difficulty to be parallelized on a…
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture…
As in various fields like scientific research and industrial application, the computation time optimization is becoming a task that is of increasing importance because of its highly parallel architecture. The graphics processing unit is…
Graphics Processing Units (GPUs) are high performance co-processors originally intended to improve the use and quality of computer graphics applications. Once, researchers and practitioners noticed the potential of using GPU for general…
With high computation power and memory bandwidth, graphics processing units (GPUs) lend themselves to accelerate data-intensive analytics, especially when such applications fit the single instruction multiple data (SIMD) model. However,…
Graphics processing units (GPU) had evolved from a specialized hardware capable to render high quality graphics in games to a commodity hardware for effective processing blocks of data in a parallel schema. This evolution is particularly…
We present our implementation of the RHMC algorithm for staggered fermions on Graphics Processing Units using the NVIDIA CUDA programming language. While previous studies exclusively deal with the Dirac matrix inversion problem, our code…
Experience shows that on today's high performance systems the utilization of different acceleration cards in conjunction with a high utilization of all other parts of the system is difficult. Future architectures, like exascale clusters,…
Parallel computing can offer an enormous advantage regarding the performance for very large applications in almost any field: scientific computing, computer vision, databases, data mining, and economics. GPUs are high performance many-core…