Compressing Tabular Data via Latent Variable Estimation

Andrea Montanari; Eric Weiner

Compressing Tabular Data via Latent Variable Estimation

Information Theory 2023-02-21 v1 math.IT

Authors: Andrea Montanari , Eric Weiner

Abstract

Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: $(i)$ Estimate latent variables associated to rows and columns; $(ii)$ Partition the table in blocks according to the row/column latents; $(iii)$ Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; $(iv)$ Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.

Keywords

source coding image compression randomized algorithm

Cite

@article{arxiv.2302.09780,
  title  = {Compressing Tabular Data via Latent Variable Estimation},
  author = {Andrea Montanari and Eric Weiner},
  journal= {arXiv preprint arXiv:2302.09780},
  year   = {2023}
}

Comments

45 pages; 6 pdf figures

Compressing Tabular Data via Latent Variable Estimation

Abstract

Keywords

Cite

Comments

Related papers