Related papers: Corra: Correlation-Aware Column Compression
The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through…
In-memory columnar databases have become mainstream over the last decade and have vastly improved the fast processing of large volumes of data through multi-core parallelism and in-memory compression thereby eliminating the usual…
Cache-aided coded multicast leverages side information at wireless edge caches to efficiently serve multiple groupcast demands via common multicast transmissions, leading to load reductions that are proportional to the aggregate cache size.…
Data warehouses organize data in a columnar format to enable faster scans and better compression. Modern systems offer a variety of column encodings that can reduce storage footprint and improve query performance. Selecting a good encoding…
Data compression is widely used in contemporary column-oriented DBMSes to lower space usage and to speed up query processing. Pioneering systems have introduced compression to tackle the disk bandwidth bottleneck by trading CPU processing…
Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for column data storage that provides several of…
Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works…
Dataset Condensation (DC) aims to obtain a condensed dataset that allows models trained on the condensed dataset to achieve performance comparable to those trained on the full dataset. Recent DC approaches increasingly focus on encoding…
Cache-aided coded multicast leverages side information at wireless edge caches to efficiently serve multiple unicast demands via common multicast transmissions, leading to load reductions that are proportional to the aggregate cache size.…
Coded caching and delivery is studied taking into account the correlations among the contents in the library. Correlations are modeled as common parts shared by multiple contents; that is, each file in the database is composed of a group of…
Motivated by applications of distributed storage systems to cloud-based key-value stores, the multi-version coding problem has been recently formulated to efficiently store frequently updated data in asynchronous decentralized storage…
Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods. Most of these applications call for a communication-constrained setting where vectors, whose mean is to be…
Motivated by applications of distributed storage systems to key-value stores, the multi-version coding problem was formulated to efficiently store frequently updated data in asynchronous decentralized storage systems. Inspired by…
Compressed Sparse Column (CSC) and Coordinate (COO) are popular compression formats for sparse matrices. However, both CSC and COO are general purpose and cannot take advantage of any of the properties of the data other than sparsity, such…
With endless amounts of data and very limited bandwidth, fast data compression is one solution for the growing datasharing problem. Compression helps lower transfer times and save memory, but if the compression takes too long, this no…
This article proposes a novel iterative algorithm based on Low Density Parity Check (LDPC) codes for compression of correlated sources at rates approaching the Slepian-Wolf bound. The setup considered in the article looks at the problem of…
Layered decoding is well appreciated in Low-Density Parity-Check (LDPC) decoder implementation since it can achieve effectively high decoding throughput with low computation complexity. This work, for the first time, addresses low…
Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve…
Traditional video compression technologies have been developed over decades in pursuit of higher coding efficiency. Efficient temporal information representation plays a key role in video coding. Thus, in this paper, we propose to exploit…
In order to accommodate the ever-growing data from various, possibly independent, sources and the dynamic nature of data usage rates in practical applications, modern cloud data storage systems are required to be scalable, flexible, and…