English

Compressing Tabular Data via Latent Variable Estimation

Information Theory 2023-02-21 v1 math.IT

Abstract

Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: (i)(i) Estimate latent variables associated to rows and columns; (ii)(ii) Partition the table in blocks according to the row/column latents; (iii)(iii) Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; (iv)(iv) Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state encoders do not achieve this rate. On the other hand, the latent estimation strategy outlined above achieves the optimal rate.

Keywords

Cite

@article{arxiv.2302.09780,
  title  = {Compressing Tabular Data via Latent Variable Estimation},
  author = {Andrea Montanari and Eric Weiner},
  journal= {arXiv preprint arXiv:2302.09780},
  year   = {2023}
}

Comments

45 pages; 6 pdf figures