Related papers: Double-Hashing Algorithm for Frequency Estimation …
We present a novel approach for the problem of frequency estimation in data streams that is based on optimization and machine learning. Contrary to state-of-the-art streaming frequency estimation algorithms, which heavily rely on random…
Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically…
The Hierarchical Heavy Hitters problem extends the notion of frequent items to data arranged in a hierarchy. This problem has applications to network traffic monitoring, anomaly detection, and DDoS detection. We present a new streaming…
Distributed Denial of Service (DDoS) attacks have become more prominent recently, both in frequency of occurrence, as well as magnitude. Such attacks render key Internet resources unavailable and disrupt its normal operation. It is…
We consider online mining of correlated heavy-hitters from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes…
Identifying heavy hitters and estimating the frequencies of flows are fundamental tasks in various network domains. Existing approaches to this challenge can broadly be categorized into two groups, hashing-based and competing-counter-based.…
Estimating frequencies of items over data streams is a common building block in streaming data measurement and analysis. Misra and Gries introduced their seminal algorithm for the problem in 1982, and the problem has since been revisited…
Finding heavy hitters in databases and data streams is a fundamental problem with applications ranging from network monitoring to database query optimization, machine learning, and more. Approximation algorithms offer practical solutions,…
Motivated by a recent new type of randomized Distributed Denial of Service (DDoS) attacks on the Domain Name Service (DNS), we develop novel and efficient distinct heavy hitters algorithms and build an attack identification system that uses…
Detecting heavy hitters, which are flows exceeding a specified threshold, is crucial for network measurement, but it faces challenges due to increasing throughput and memory constraints. Existing sketch-based solutions, particularly those…
In data stream applications, one of the critical issues is to estimate the frequency of each item in the specific multiset. The multiset means that each item in this set can appear multiple times. The data streams in many applications are…
Frequency estimation is one of the most fundamental problems in streaming algorithms. Given a stream $S$ of elements from some universe $U=\{1 \ldots n\}$, the goal is to compute, in a single pass, a short sketch of $S$ so that for any…
Discovering frequent episodes over event sequences is an important data mining task. In many applications, events constituting the data sequence arrive as a stream, at furious rates, and recent trends (or frequent episodes) can change and…
Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated…
To get estimators that work within a certain error bound with high probability, a common strategy is to design one that works with constant probability, and then boost the probability using independent repetitions. Important examples of…
The challenge of estimating similarity between sets has been a significant concern in data science, finding diverse applications across various domains. However, previous approaches, such as MinHash, have predominantly centered around…
Most density based stream clustering algorithms separate the clustering process into an online and offline component. Exact summarized statistics are being employed for defining micro-clusters or grid cells during the online stage followed…
When facing a very large stream of data, it is often desirable to extract most important statistics online in a short time and using small memory. For example, one may want to quickly find the most influential users generating posts online…
Streaming algorithms are fundamental in the analysis of large and online datasets. A key component of many such analytic tasks is $q$-MAX, which finds the largest $q$ values in a number stream. Modern approaches attain a constant runtime by…
Big data streams are possibly one of the most essential underlying notions. However, data streams are often challenging to handle owing to their rapid pace and limited information lifetime. It is difficult to collect and communicate stream…