English

A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting

Data Structures and Algorithms 2008-08-21 v2

Abstract

Compressed Counting (CC)} was recently proposed for approximating the α\alphath frequency moments of data streams, for 0<α20<\alpha \leq 2. Under the relaxed strict-Turnstile model, CC dramatically improves the standard algorithm based on symmetric stable random projections}, especially as α1\alpha\to 1. A direct application of CC is to estimate the entropy, which is an important summary statistic in Web/network measurement and often serves a crucial "feature" for data mining. The R\'enyi entropy and the Tsallis entropy are functions of the α\alphath frequency moments; and both approach the Shannon entropy as α1\alpha\to 1. A recent theoretical work suggested using the α\alphath frequency moment to approximate the Shannon entropy with α=1+δ\alpha=1+\delta and very small δ|\delta| (e.g., <104<10^{-4}). In this study, we experiment using CC to estimate frequency moments, R\'enyi entropy, Tsallis entropy, and Shannon entropy, on real Web crawl data. We demonstrate the variance-bias trade-off in estimating Shannon entropy and provide practical recommendations. In particular, our experiments enable us to draw some important conclusions: (1) As α1\alpha\to 1, CC dramatically improves {\em symmetric stable random projections} in estimating frequency moments, R\'enyi entropy, Tsallis entropy, and Shannon entropy. The improvements appear to approach "infinity." (2) Using {\em symmetric stable random projections} and α=1+δ\alpha = 1+\delta with very small δ|\delta| does not provide a practical algorithm because the required sample size is enormous.

Keywords

Cite

@article{arxiv.0808.1771,
  title  = {A Very Efficient Scheme for Estimating Entropy of Data Streams Using Compressed Counting},
  author = {Ping Li},
  journal= {arXiv preprint arXiv:0808.1771},
  year   = {2008}
}
R2 v1 2026-06-21T11:09:53.505Z