Related papers: Trie Compression for GPU Accelerated Multi-Pattern…

A Highly-Efficient Memory-Compression Scheme for GPU-Accelerated Intrusion Detection Systems

Pattern Matching is a computationally intensive task used in many research fields and real world applications. Due to the ever-growing volume of data to be processed, and increasing link speeds, the number of patterns to be matched has…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-04-10 Xavier Bellekens , Christos Tachtatzis , Robert Atkinson , Craig Renfrew , Tony Kirkham

Graph Compression Using Pattern Matching Techniques

Graphs can be used to represent a wide variety of data belonging to different domains. Graphs can capture the relationship among data in an efficient way, and have been widely used. In recent times, with the advent of Big Data, there has…

Data Structures and Algorithms · Computer Science 2018-06-06 Rushabh Jitendrakumar Shah

To Use or Not to Use: Graphics Processing Units for Pattern Matching Algorithms

String matching is an important part in today's computer applications and Aho-Corasick algorithm is one of the main string matching algorithms used to accomplish this. This paper discusses that when can the GPUs be used for string matching…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-12-30 Vajira Thambawita , Roshan Ragel , Dhammika Elkaduwe

A Hybrid Parallel Implementation of the Aho-Corasick and Wu-Manber Algorithms Using NVIDIA CUDA and MPI Evaluated on a Biological Sequence Database

Multiple matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho-Corasick and Wu-Manber, two of the most well known algorithms for multiple matching require an increased…

Distributed, Parallel, and Cluster Computing · Computer Science 2014-07-11 Charalampos S. Kouzinopoulos , John-Alexander M. Assael , Themistoklis K. Pyrgiotis , Konstantinos G. Margaritis

GPUs as Storage System Accelerators

Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Samer Al-Kiswany , Abdullah Gharaibeh , Matei Ripeanu

A High-Throughput GPU Framework for Adaptive Lossless Compression of Floating-Point Data

The torrential influx of floating-point data from domains like IoT and HPC necessitates high-performance lossless compression to mitigate storage costs while preserving absolute data fidelity. Leveraging GPU parallelism for this task…

Databases · Computer Science 2025-11-12 Zheng Li , Weiyan Wang , Ruiyuan Li , Chao Chen , Xianlei Long , Linjiang Zheng , Quanqing Xu , Chuanhui Yang

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-30 Mufakir Qamar Ansari , Mudabir Qamar Ansari

Improved Parallel Rabin-Karp Algorithm Using Compute Unified Device Architecture

String matching algorithms are among one of the most widely used algorithms in computer science. Traditional string matching algorithms efficiency of underlaying string matching algorithm will greatly increase the efficiency of any…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-10-03 Parth Shah , Rachana Oza

A Graph-based Model for GPU Caching Problems

Modeling data sharing in GPU programs is a challenging task because of the massive parallelism and complex data sharing patterns provided by GPU architectures. Better GPU caching efficiency can be achieved through careful task scheduling…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-10-04 Lingda Li , Ari B. Hayes , Stephen A. Hackler , Eddy Z. Zhang , Mario Szegedy , Shuaiwen Leon Song

Massively-Parallel Lossless Data Decompression

Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-03 Evangelia Sitaridi , Rene Mueller , Tim Kaldewey , Guy Lohman , Kenneth Ross

Parallel Data Compression Techniques

With endless amounts of data and very limited bandwidth, fast data compression is one solution for the growing datasharing problem. Compression helps lower transfer times and save memory, but if the compression takes too long, this no…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-06-21 David Noel , Elizabeth Graham , Liyuan Liu

A Fast and Generic GPU-Based Parallel Reduction Implementation

Reduction operations are extensively employed in many computational problems. A reduction consists of, given a finite set of numeric elements, combining into a single value all elements in that set, using for this a combiner function. A…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-23 Walid Jradi , Hugo do Nascimento , Wellington Martins

gMatch: Fine-Grained and Hardware-Efficient Subgraph Matching on GPUs

Subgraph matching is a core operation in graph analytics, supporting a broad spectrum of applications from social network analysis to bioinformatics. Recent GPU-based approaches accelerate subgraph matching by leveraging parallelism but…

Databases · Computer Science 2026-04-14 Weitian Chen , Shixuan Sun , Cheng Chen , Yongmin Hu , Yingqian Hu , Minyi Guo

Accelerating Concurrent Heap on GPUs

Priority queue, often implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstra's shortest path algorithm, Prim's minimum spanning tree, Huffman encoding, and the branch-and-bound…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-18 Yanhao Chen , Fei Hua , Chaozhang Huang , Jeremy Bierema , Chi Zhang , Eddy Z. Zhang

CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs

Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-07-11 Jeongmin Park , Zaid Qureshi , Vikram Mailthody , Andrew Gacek , Shunfan Shao , Mohammad AlMasri , Isaac Gelado , Jinjun Xiong , Chris Newburn , I-hsin Chung , Michael Garland , Nikolay Sakharnykh , Wen-mei Hwu

Optimizing Bloom Filters for Modern GPU Architectures

Bloom filters are a fundamental data structure for approximate membership queries, with applications ranging from data analytics to databases and genomics. Several variants have been proposed to accommodate parallel architectures. GPUs,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Daniel Jünger , Kevin Kristensen , Yunsong Wang , Xiangyao Yu , Bertil Schmidt

GPU-Accelerated Algorithms for Process Mapping

Process mapping asks to assign vertices of a task graph to processing elements of a supercomputer such that the computational workload is balanced while the communication cost is minimized. Motivated by the recent success of GPU-based graph…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-16 Petr Samoldekin , Christian Schulz , Henning Woydt

Highly Parallel Sparse Matrix-Matrix Multiplication

Generalized sparse matrix-matrix multiplication is a key primitive for many high performance graph algorithms as well as some linear solvers such as multigrid. We present the first parallel algorithms that achieve increasing speedups for an…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-08-09 Aydın Buluç , John R. Gilbert

Does compressing activations help model parallel training?

Large-scale Transformer models are known for their exceptional performance in a range of tasks, but training them can be difficult due to the requirement for communication-intensive model parallelism. One way to improve training speed is to…

Machine Learning · Computer Science 2023-01-09 Song Bian , Dacheng Li , Hongyi Wang , Eric P. Xing , Shivaram Venkataraman

Design Principles for Sparse Matrix Multiplication on the GPU

We implement two novel algorithms for sparse-matrix dense-matrix multiplication (SpMM) on the GPU. Our algorithms expect the sparse input in the popular compressed-sparse-row (CSR) format and thus do not require expensive format conversion.…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-13 Carl Yang , Aydin Buluc , John D. Owens