Compressing Tabular Data via Latent Variable Estimation
Abstract
Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: Estimate latent variables associated to rows and columns; Partition the table in blocks according to the row/column latents; Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.
Cite
@article{arxiv.2302.09780,
title = {Compressing Tabular Data via Latent Variable Estimation},
author = {Andrea Montanari and Eric Weiner},
journal= {arXiv preprint arXiv:2302.09780},
year = {2023}
}
Comments
45 pages; 6 pdf figures