Related papers: High-Throughput Parallel Viterbi Decoder on GPU Te…

High-Throughput and Memory-Efficient Parallel Viterbi Decoder for Convolutional Codes on GPU

This paper describes a parallel implementation of Viterbi decoding algorithm. Viterbi decoder is widely used in many state-of-the-art wireless systems. The proposed solution optimizes both throughput and memory usage by applying…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-11-19 Alireza Mohammadidoost , Matin Hashemi

Analyzing GPU Tensor Core Potential for Fast Reductions

The Nvidia GPU architecture has introduced new computing elements such as the \textit{tensor cores}, which are special processing units dedicated to perform fast matrix-multiply-accumulate (MMA) operations and accelerate \textit{Deep…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-03-12 Roberto Carrasco , Raimundo Vega , Cristóbal A. Navarro

A Gb/s Parallel Block-based Viterbi Decoder for Convolutional Codes on GPU

In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and the characteristic of…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-02 Hao Peng , Rongke Liu , Yi Hou , Ling Zhao

Generating coupled cluster code for modern distributed memory tensor software

Using GPU-based HPC platforms efficiently for coupled cluster computations is a challenge due to heterogeneous hardware structures. The constant need to adapt software to these structures and the required man-hours makes a systematization…

Chemical Physics · Physics 2025-10-07 Jan Brandejs , Johann Pototschnig , Trond Saue

Parallel Weighted Model Counting with Tensor Networks

A promising new algebraic approach to weighted model counting makes use of tensor networks, following a reduction from weighted model counting to tensor-network contraction. Prior work has focused on analyzing the single-core performance of…

Data Structures and Algorithms · Computer Science 2021-06-16 Jeffrey M. Dudek , Moshe Y. Vardi

GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition

We present an optimized weighted finite-state transducer (WFST) decoder capable of online streaming and offline batch processing of audio using Graphics Processing Units (GPUs). The decoder is efficient in memory utilization, input/output…

Computation and Language · Computer Science 2020-02-17 Hugo Braun , Justin Luitjens , Ryan Leary , Tim Kaldewey , Daniel Povey

The Tensor-Core Beamformer: A High-Speed Signal-Processing Library for Multidisciplinary Use

Beamforming is a well-known technique to combine signals from multiple sensors. It has a wide range of application domains. This paper introduces the Tensor-Core Beamformer: a generic, optimized beamformer library that harnesses the…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-07 Leon Oostrum , Bram Veenboer , Ronald Rook , Michael Brown , Pieter Kruizinga , John W. Romein

Reducing the Computational Cost Scaling of Tensor Network Algorithms via Field-Programmable Gate Array Parallelism

Improving the computational efficiency of quantum many-body calculations from a hardware perspective remains a critical challenge. Although field-programmable gate arrays (FPGAs) have recently been exploited to improve the computational…

Strongly Correlated Electrons · Physics 2026-02-06 Songtai Lv , Yang Liang , Rui Zhu , Qibin Zheng , Haiyuan Zou

GPU Implementation and Optimization of a Flexible MAP Decoder for Synchronization Correction

In this paper we present an optimized parallel implementation of a flexible MAP decoder for synchronization error correcting codes, supporting a very wide range of code sizes and channel conditions. On mid-range GPUs we demonstrate decoding…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-02-26 Johann A. Briffa

Mixed precision in Graphics Processing Unit

Modern graphics computing units (GPUs) are designed and optimized to perform highly parallel numerical calculations. This parallelism has enabled (and promises) significant advantages, both in terms of energy performance and calculation. In…

Hardware Architecture · Computer Science 2021-10-26 Quentin Gallouédec

PAGANI: A Parallel Adaptive GPU Algorithm for Numerical

We present a new adaptive parallel algorithm for the challenging problem of multi-dimensional numerical integration on massively parallel architectures. Adaptive algorithms have demonstrated the best performance, but efficient many-core…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-06-24 Ioannis Sakiotis , Kamesh Arumugam , Marc Paterno , Desh Ranjan , Balša Terzić , Mohammad Zubair

Modeling Deep Learning Accelerator Enabled GPUs

The efficacy of deep learning has resulted in its use in a growing number of applications. The Volta graphics processor unit (GPU) architecture from NVIDIA introduced a specialized functional unit, the "tensor core", that helps meet the…

Mathematical Software · Computer Science 2019-02-22 Md Aamir Raihan , Negar Goli , Tor Aamodt

Efficient ML Decoding for Quantum Convolutional Codes

A novel decoding algorithm is developed for general quantum convolutional codes. Exploiting useful ideas from classical coding theory, the new decoder introduces two innovations that drastically reduce the decoding complexity compared to…

Quantum Physics · Physics 2015-03-13 Peiyu Tan , Jing Li

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs.…

Hardware Architecture · Computer Science 2022-11-29 Wei Sun , Ang Li , Tong Geng , Sander Stuijk , Henk Corporaal

A Variant of Concurrent Constraint Programming on GPU

The number of cores on graphical computing units (GPUs) is reaching thousands nowadays, whereas the clock speed of processors stagnates. Unfortunately, constraint programming solvers do not take advantage yet of GPU parallelism. One reason…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-26 Pierre Talbot , Frédéric Pinel , Pascal Bouvry

On the performance of various parallel GMRES implementations on CPU and GPU clusters

As the need for computational power and efficiency rises, parallel systems become increasingly popular among various scientific fields. While multiple core-based architectures have been the center of attention for many years, the rapid…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 E. I. Ioannidis , N. Cheimarios , A. N. Spyropoulos , A. G. Boudouvis

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-02 Jiannan Tian , Cody Rivera , Sheng Di , Jieyang Chen , Xin Liang , Dingwen Tao , Franck Cappello

cuTT: A High-Performance Tensor Transpose Library for CUDA Compatible GPUs

We introduce the CUDA Tensor Transpose (cuTT) library that implements high-performance tensor transposes for NVIDIA GPUs with Kepler and above architectures. cuTT achieves high performance by (a) utilizing two GPU-optimized transpose…

Mathematical Software · Computer Science 2017-05-05 Antti-Pekka Hynninen , Dmitry I. Lyakh

GPU Tensor Cores for fast Arithmetic Reductions

This work proposes a GPU tensor core approach that encodes the arithmetic reduction of $n$ numbers as a set of chained $m \times m$ matrix multiply accumulate (MMA) operations executed in parallel by GPU tensor cores. The asymptotic running…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-01-17 Cristóbal A. Navarro , Roberto Carrasco , Ricardo J. Barrientos , Javier A. Riquelme , Raimundo Vega

VDCores: Resource Decoupled Programming and Execution for Asynchronous GPU

Modern GPUs increasingly rely on specialized and asynchronous hardware units to deliver high performance. Yet these units are often underutilized because today's GPU software stacks still organize programming and execution around a…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-06 Zijian He , Adrian Sampson , Yiying Zhang , Zhiyuan Guo