Related papers: Lightweight Correlation-Aware Table Compression
Column encoding schemes have witnessed a spark of interest with the rise of open storage formats (like Parquet) in data lakes in modern cloud deployments. This is not surprising -- as data volume increases, it becomes more and more…
Cache-aided coded multicast leverages side information at wireless edge caches to efficiently serve multiple groupcast demands via common multicast transmissions, leading to load reductions that are proportional to the aggregate cache size.…
Cropping high-resolution document images into multiple sub-images is the most widely used approach for current Multimodal Large Language Models (MLLMs) to do document understanding. Most of current document understanding methods preserve…
This paper addresses the problem of correlation estimation in sets of compressed images. We consider a framework where images are represented under the form of linear measurements due to low complexity sensing or security requirements. We…
The communication bottleneck in federated learning (FL) has spurred extensive research into techniques to reduce the volume of data exchanged between client devices and the central parameter server. In this paper, we systematically classify…
Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for column data storage that provides several of…
We present a filter correlation based model compression approach for deep convolutional neural networks. Our approach iteratively identifies pairs of filters with the largest pairwise correlations and drops one of the filters from each such…
The construction of highly coherent x-ray sources has enabled new research opportunities across the scientific landscape. The maximum raw data rate per beamline now exceeds 40 GB/s, posing unprecedented challenges for the online processing…
Compressing the KV cache is a required step to deploy large language models on edge devices. Current quantization methods compress storage but fail to reduce bandwidth as attention calculation requires dequantizing keys from INT4/INT8 to…
Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to…
Driven by significant improvements in architectural design and training pipelines, computer vision has recently experienced dramatic progress in terms of accuracy on classic benchmarks such as ImageNet. These highly-accurate models are…
Time series data from a variety of sensors and IoT devices need effective compression to reduce storage and I/O bandwidth requirements. While most time series databases and systems rely on lossless compression, lossy techniques offer even…
The deployment of modern network applications is increasing the network size and traffic volumes at an unprecedented pace. Storing network-related information (e.g., traffic traces) is key to enable efficient network management. However,…
Data lakes, increasingly adopted for their ability to store and analyze diverse types of data, commonly use columnar storage formats like Parquet and ORC for handling relational tables. However, these traditional setups fall short when it…
Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications.…
Lightweight data compression is a key technique that allows column stores to exhibit superior performance for analytical queries. Despite a comprehensive study on dictionary-based encodings to approach Shannon's entropy, few prior works…
A method is presented to automatically generate context models of data by calculating the data's autocorrelation function. The largest values of the autocorrelation function occur at the offsets or lags in the bitstream which tend to be the…
The proliferation of small files in data lakes poses significant challenges, including degraded query performance, increased storage costs, and scalability bottlenecks in distributed storage systems. Log-structured table formats (LSTs) such…
Increasing amounts of structured data can provide value for research and business if the relevant data can be located. Often the data is in a data lake without a consistent schema, making locating useful data challenging. Table search is a…
Cache-aided coded multicast leverages side information at wireless edge caches to efficiently serve multiple unicast demands via common multicast transmissions, leading to load reductions that are proportional to the aggregate cache size.…