Related papers: Fast Concurrent Data Sketches
Sketches are a family of streaming algorithms widely used in the world of big data to perform fast, real-time analytics. A popular sketch type is Quantiles, which estimates the data distribution of a large input stream. We present…
In data stream applications, one of the critical issues is to estimate the frequency of each item in the specific multiset. The multiset means that each item in this set can appear multiple times. The data streams in many applications are…
We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying…
Summary statistics such as the mean and variance are easily maintained for large, distributed data streams, but order statistics (i.e., sample quantiles) can only be approximately summarized. There is extensive literature on maintaining…
A sketch is a probabilistic data structure used to record frequencies of items in a multi-set. Sketches are widely used in various fields, especially those that involve processing and storing data streams. In streaming applications with…
A key need in different disciplines is to perform analytics over fast-paced data streams, similar in nature to the traditional OLAP analytics in relational databases i.e., with filters and aggregates. Storing unbounded streams, however, is…
Linear sketching algorithms have been widely used for processing large-scale distributed and streaming datasets. Their popularity is largely due to the fact that linear sketches can be naturally composed in the distributed model and be…
Sketches are commonly used in computer systems and network monitoring tools to provide efficient query executions while maintaining a compact data representation. Switches and routers maintain sketches to track statistical characteristics…
Sketches are probabilistic data structures that can provide approximate results within mathematically proven error bounds while using orders of magnitude less memory than traditional approaches. They are tailored for streaming data analysis…
A graph stream is a continuous sequence of data items, in which each item indicates an edge, including its two endpoints and edge weight. It forms a dynamic graph that changes with every item in the stream. Graph streams play important…
Streaming analytics are essential in a large range of applications, including databases, networking, and machine learning. To optimize performance, practitioners are increasingly offloading such analytics to network nodes such as switches.…
Large, distributed data streams are now ubiquitous. High-accuracy sketches with low memory overhead have become the de facto method for analyzing this data. For instance, if we wish to group data by some label and report the largest counts…
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…
The vast amounts of data used in social, business or traffic networks, biology and other natural sciences are often managed in graph-based data sets, consisting of a few thousand up to billions and trillions of vertices and edges,…
Parallel batched data structures are designed to process synchronized batches of operations in a parallel computing model. In this paper, we propose parallel combining, a technique that implements a concurrent data structure from a parallel…
In this paper, we consider the problem of estimating the distance between any two large data streams in small- space constraint. This problem is of utmost importance in data intensive monitoring applications where input streams are…
The challenge of estimating similarity between sets has been a significant concern in data science, finding diverse applications across various domains. However, previous approaches, such as MinHash, have predominantly centered around…
To approximate sums of values in key-value data streams, sketches are widely used in databases and networking systems. They offer high-confidence approximations for any given key while ensuring low time and space overhead. While existing…
Network stream mining is fundamental to many network operations. Sketches, as compact data structures that offer low memory overhead with bounded accuracy, have emerged as a promising solution for network stream mining. Recent studies…
Data sketching is a critical tool for distinct counting, enabling multisets to be represented by compact summaries that admit fast cardinality estimates. Because sketches may be merged to summarize multiset unions, they are a basic building…