Related papers: Sampling Space-Saving Set Sketches
Data sketching is a critical tool for distinct counting, enabling multisets to be represented by compact summaries that admit fast cardinality estimates. Because sketches may be merged to summarize multiset unions, they are a basic building…
Stream monitoring is fundamental in many data stream applications, such as financial data trackers, security, anomaly detection, and load balancing. In that respect, quantiles are of particular interest, as they often capture the user's…
In rapid and massive data streams, it is often not possible to estimate the frequency of items with complete accuracy. To perform the operation in a reasonable amount of space and with sufficiently low latency, approximated methods are…
We introduce and study a new data sketch for processing massive datasets. It addresses two common problems: 1) computing a sum given arbitrary filter conditions and 2) identifying the frequent items or heavy hitters in a data set. For the…
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…
Estimating the number of distinct elements in a data stream is well understood when repeated elements are identical. In modern settings, however, observations are high-dimensional and noisy, so repeated instances of the same object are only…
The challenge of estimating similarity between sets has been a significant concern in data science, finding diverse applications across various domains. However, previous approaches, such as MinHash, have predominantly centered around…
We introduce a new sub-linear space sketch---the Weight-Median Sketch---for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the model. This enables…
The sliding window model of computation captures scenarios in which data are continually arriving in the form of a stream, and only the most recent $w$ items are used for analysis. In this setting, an algorithm needs to accurately track…
While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is…
Modern data stream applications demand memory-efficient solutions for accurately tracking frequent items, such as heavy hitters and heavy changers, under strict resource constraints. Traditional sketches face inherent accuracy-memory…
Sketch-based streaming algorithms allow efficient processing of big data. These algorithms use small fixed-size storage to store a summary ("sketch") of the input data, and use probabilistic algorithms to estimate the desired quantity.…
Data sketches are approximate succinct summaries of long streams. They are widely used for processing massive amounts of data and answering statistical queries about it in real-time. Existing libraries producing sketches are very fast, but…
Estimating cardinality, i.e., the number of distinct elements, of a data stream is a fundamental problem in areas like databases, computer networks, and information retrieval. This study delves into a broader scenario where each element…
Structured high-cardinality data arises in many domains, and poses a major challenge for both modeling and inference. Graphical models are a popular approach to modeling structured data but they are unsuitable for high-cardinality…
The unsupervised learning of community structure, in particular the partitioning vertices into clusters or communities, is a canonical and well-studied problem in exploratory graph analysis. However, like most graph analyses the…
Graph streams represent data interactions in real applications. The mining of graph streams plays an important role in network security, social network analysis, and traffic control, among others. However, the sheer volume and high dynamics…
Document sketching using Jaccard similarity has been a workable effective technique in reducing near-duplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning…
In this paper, we consider the problem of estimating the distance between any two large data streams in small- space constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are…
In data stream applications, one of the critical issues is to estimate the frequency of each item in the specific multiset. The multiset means that each item in this set can appear multiple times. The data streams in many applications are…