Related papers: CODAG: Characterizing and Optimizing Decompression…

G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression

Text analytics directly on compression (TADOC) has proven to be a promising technology for big data analytics. GPUs are extremely popular accelerators for data analytics systems. Unfortunately, no work so far shows how to utilize GPUs to…

Databases · Computer Science 2021-06-15 Feng Zhang , Zaifeng Pan , Yanliang Zhou , Jidong Zhai , Xipeng Shen , Onur Mutlu , Xiaoyong Du

GPU Acceleration of SQL Analytics on Compressed Data

GPUs are uniquely suited to accelerate (SQL) analytics workloads thanks to their massive compute parallelism and High Bandwidth Memory (HBM) -- when datasets fit in the GPU HBM, performance is unparalleled. Unfortunately, GPU HBMs remain…

Databases · Computer Science 2025-09-05 Zezhou Huang , Krystian Sakowski , Hans Lehnert , Wei Cui , Carlo Curino , Matteo Interlandi , Marius Dumitru , Rathijit Sen

Massively-Parallel Lossless Data Decompression

Today's exponentially increasing data volumes and the high cost of storage make compression essential for the Big Data industry. Although research has concentrated on efficient compression, fast decompression is critical for analytics…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-06-03 Evangelia Sitaridi , Rene Mueller , Tim Kaldewey , Guy Lohman , Kenneth Ross

ZipFlow: a Compiler-based Framework to Unleash Compressed Data Movement for Modern GPUs

In GPU-accelerated data analytics, the overhead of data transfer from CPU to GPU becomes a performance bottleneck when the data scales beyond GPU memory capacity due to the limited PCIe bandwidth. Data compression has come to rescue for…

Databases · Computer Science 2026-02-10 Gwangoo Yeo , Zhiyang Shen , Wei Cui , Matteo Interlandi , Rathijit Sen , Bailu Ding , Qi Chen , Minsoo Rhu

A GPU Register File using Static Data Compression

GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper…

Hardware Architecture · Computer Science 2020-12-10 Alexandra Angerd , Erik Sintorn , Per Stenström

Hiding Latencies in Network-Based Image Loading for Deep Learning

In the last decades, the computational power of GPUs has grown exponentially, allowing current deep learning (DL) applications to handle increasingly large amounts of data at a progressively higher throughput. However, network and storage…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-08 Francesco Versaci , Giovanni Busonera

Accelerating Lossless Data Compression with GPUs

Huffman compression is a statistical, lossless, data compression algorithm that compresses data by assigning variable length codes to symbols, with the more frequently appearing symbols given shorter codes than the less. This work is a…

Information Theory · Computer Science 2011-07-11 R. L. Cloud , M. L. Curry , H. L. Ward , A. Skjellum , P. Bangalore

Accelerating JPEG Decompression on GPUs

The JPEG compression format has been the standard for lossy image compression for over multiple decades, offering high compression rates at minor perceptual loss in image quality. For GPU-accelerated computer vision and deep learning tasks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-11-18 André Weißenberger , Bertil Schmidt

Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications

This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-08 Seonho Lee , Jihwan Oh , Junkyum Kim , Seokjin Go , Jongse Park , Divya Mahajan

Access Pattern-Based Code Compression for Memory-Constrained Embedded Systems

As compared to a large spectrum of performance optimizations, relatively little effort has been dedicated to optimize other aspects of embedded applications such as memory space requirements, power, real-time predictability, and…

Other Computer Science · Computer Science 2011-11-09 O. Ozturk , H. Saputra , M. Kandemir , I. Kolcu

CoEdge-RAG: Optimizing Hierarchical Scheduling for Retrieval-Augmented LLMs in Collaborative Edge Computing

Motivated by the imperative for real-time responsiveness and data privacy preservation, large language models (LLMs) are increasingly deployed on resource-constrained edge devices to enable localized inference. To improve output quality,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-11 Guihang Hong , Tao Ouyang , Kongyange Zhao , Zhi Zhou , Xu Chen

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

In the rapidly advancing field of image generation, Visual Auto-Regressive (VAR) modeling has garnered considerable attention for its innovative next-scale prediction approach. This paradigm offers substantial improvements in efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Zigeng Chen , Xinyin Ma , Gongfan Fang , Xinchao Wang

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction.…

Machine Learning · Computer Science 2021-12-17 Tianfeng Liu , Yangrui Chen , Dan Li , Chuan Wu , Yibo Zhu , Jun He , Yanghua Peng , Hongzheng Chen , Hongzhi Chen , Chuanxiong Guo

Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs

GPUs offer orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, GPU device memory tends to be relatively small and the memory capacity can not be increased by the user. This paper describes Buddy…

Hardware Architecture · Computer Science 2019-04-17 Esha Choukse , Michael Sullivan , Mike O'Connor , Mattan Erez , Jeff Pool , David Nellans , Steve Keckler

HP-MDR: High-performance and Portable Data Refactoring and Progressive Retrieval with Advanced GPUs

Scientific applications produce vast amounts of data, posing grand challenges in the underlying data management and analytic tasks. Progressive compression is a promising way to address this problem, as it allows for on-demand data…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-02 Yanliang Li , Wenbo Li , Qian Gong , Qing Liu , Norbert Podhorszki , Scott Klasky , Xin Liang , Jieyang Chen

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-01-11 Mhd Ghaith Olabi , Juan Gómez Luna , Onur Mutlu , Wen-mei Hwu , Izzat El Hajj

RAGE for the Machine: Image Compression with Low-Cost Random Access for Embedded Applications

We introduce RAGE, an image compression framework that achieves four generally conflicting objectives: 1) good compression for a wide variety of color images, 2) computationally efficient, fast decompression, 3) fast random access of images…

Image and Video Processing · Electrical Eng. & Systems 2024-02-12 Christian D. Rask , Daniel E. Lucani

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

GPUs have been widely used to accelerate computations exhibiting simple patterns of parallelism - such as flat or two-level parallelism - and a degree of parallelism that can be statically determined based on the size of the input dataset.…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-11-18 Hancheng Wu , Da Li , Michela Becchi

Trie Compression for GPU Accelerated Multi-Pattern Matching

Graphics Processing Units allow for running massively parallel applications offloading the CPU from computationally intensive resources, however GPUs have a limited amount of memory. In this paper a trie compression algorithm for massively…

Data Structures and Algorithms · Computer Science 2017-02-20 Xavier Bellekens , Amar Seeam , Christos Tachtatzis , Robert Atkinson

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Deep Learning system architects strive to design a balanced system where the computational accelerator -- FPGA, GPU, etc, is not starved for data. Feeding training data fast enough to effectively keep the accelerator utilization high is…

Performance · Computer Science 2018-12-04 Christian Pinto , Yiannis Gkoufas , Andrea Reale , Seetharami Seelam , Steven Eliuk