Related papers: A Very Efficient Scheme for Estimating Entropy of …
The Shannon entropy is a widely used summary statistic, for example, network traffic measurement, anomaly detection, neural computations, spike trains, etc. This study focuses on estimating Shannon entropy of data streams. It is known that…
Compressed Counting (CC) [22] was recently proposed for estimating the ath frequency moments of data streams, where 0 < a <= 2. CC can be used for estimating Shannon entropy, which can be approximated by certain functions of the ath…
Compressed Counting (CC), based on maximally skewed stable random projections, was recently proposed for estimating the p-th frequency moments of data streams. The case p->1 is extremely useful for estimating Shannon entropy of data…
Compressed Counting (CC) was recently proposed for very efficiently computing the (approximate) $\alpha$th frequency moments of data streams, where $0<\alpha <= 2$. Several estimators were reported including the geometric mean estimator,…
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is…
Counting is among the most fundamental operations in computing. For example, counting the pth frequency moment has been a very active area of research, in theoretical computer science, databases, and data mining. When p=1, the task (i.e.,…
Estimating the p-th frequency moment of data stream is a very heavily studied problem. The problem is actually trivial when p = 1, assuming the strict Turnstile model. The sample complexity of our proposed algorithm is essentially O(1) near…
Estimation of Shannon and R\'enyi entropies of unknown discrete distributions is a fundamental problem in statistical property testing and an active research topic in both theoretical computer science and information theory. Tight bounds on…
Estimating entropies from limited data series is known to be a non-trivial task. Naive estimations are plagued with both systematic (bias) and statistical errors. Here, we present a new 'balanced estimator' for entropy functionals Shannon,…
We conclude a sequence of work by giving near-optimal sketching and streaming algorithms for estimating Shannon entropy in the most general streaming model, with arbitrary insertions and deletions. This improves on prior results that obtain…
Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the R\'enyi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be…
Algorithmic entropy and Shannon entropy are two conceptually different information measures, as the former is based on size of programs and the later in probability distributions. However, it is known that, for any recursive probability…
A new method is proposed for analyzing complexity and studying the information in random geometric networks using Tsallis entropy tool. Tsallis entropy of the ensemble of random geometric networks is calculated based on the components of…
We propose skewed stable random projections for approximating the pth frequency moments of dynamic data streams (0<p<=2), which has been frequently studied in theoretical computer science and database communities. Our method significantly…
Modern statistical estimation is often performed in a distributed setting where each sample belongs to a single user who shares their data with a central server. Users are typically concerned with preserving the privacy of their samples,…
The weak law of large numbers implies that, under mild assumptions on the source, the Renyi entropy per produced symbol converges (in probability) towards the Shannon entropy rate. This paper quantifies the speed of this convergence for…
We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random…
Subword tokenization is a key part of many NLP pipelines. However, little is known about why some tokenizer and hyperparameter combinations lead to better downstream model performance than others. We propose that good tokenizers lead to…
This article studies the fundamental problem of using i.i.d. coin tosses from an entropy source to efficiently generate random variables $X_i \sim P_i$ $(i \ge 1)$, where $(P_1, P_2, \dots)$ is a random sequence of rational discrete…
Entropy rate of sequential data-streams naturally quantifies the complexity of the generative process. Thus entropy rate fluctuations could be used as a tool to recognize dynamical perturbations in signal sources, and could potentially be…