Related papers: SALSA: Self-Adjusting Lean Streaming Analytics
Modern stream processing systems often need to track the frequency of distinct keys in a data stream in real-time. Since maintaining exact counts can require a prohibitive amount of memory, many applications rely on compact, probabilistic…
Due to the large data volume and number of distinct elements, space is often the bottleneck of many stream processing systems. The data structures used by these systems often consist of counters whose optimization yields significant memory…
Matrix sketching is a recently developed data compression technique. An input matrix A is efficiently approximated with a smaller matrix B, so that B preserves most of the properties of A up to some guaranteed approximation ratio. In so…
With the exponentially growing Internet traffic, sketch data structure with a probabilistic algorithm has been expected to be an alternative solution for non-compromised (non-selective) security monitoring. While facilitating counting…
The challenge of estimating similarity between sets has been a significant concern in data science, finding diverse applications across various domains. However, previous approaches, such as MinHash, have predominantly centered around…
Streaming analytics are essential in a large range of applications, including databases, networking, and machine learning. To optimize performance, practitioners are increasingly offloading such analytics to network nodes such as switches.…
Sketch-and-solve (SAS) is a very successful method to efficiently estimate the solution of heavily overdetermined large linear least squares problems. It uses random sketching to reduce the size of the problem, hence reducing the…
Large, distributed data streams are now ubiquitous. High-accuracy sketches with low memory overhead have become the de facto method for analyzing this data. For instance, if we wish to group data by some label and report the largest counts…
Measuring network flow sizes is important for tasks like accounting/billing, network forensics and security. Per-flow accounting is considered hard because it requires that many counters be updated at a very high speed; however, the large…
MinHash and HyperLogLog are sketching algorithms that have become indispensable for set summaries in big data applications. While HyperLogLog allows counting different elements with very little space, MinHash is suitable for the fast…
Matrix sketching is a powerful tool for reducing the size of large data matrices. Yet there are fundamental limitations to this size reduction when we want to recover an accurate estimator for a task such as least square regression. We show…
Demands are increasing to measure per-flow statistics in the data plane of high-speed switches. Measuring flows with exact counting is infeasible due to processing and memory constraints, but a sketch is a promising candidate for collecting…
We propose OverSketch, an approximate algorithm for distributed matrix multiplication in serverless computing. OverSketch leverages ideas from matrix sketching and high-performance computing to enable cost-efficient multiplication that is…
Graph streams are rapidly evolving sequences of edges that convey continuously changing relationships among entities, playing a crucial role in domains such as networking, finance, and cybersecurity. Their massive scale and high dynamism…
Data sketches are approximate succinct summaries of long streams. They are widely used for processing massive amounts of data and answering statistical queries about it in real-time. Existing libraries producing sketches are very fast, but…
Count-Min Sketch is a widely adopted algorithm for approximate event counting in large scale processing. However, the original version of the Count-Min-Sketch (CMS) suffers of some deficiences, especially if one is interested by the…
Many tasks in machine learning and data mining, such as data diversification, non-parametric learning, kernel machines, clustering etc., require extracting a small but representative summary from a massive dataset. Often, such problems can…
This survey highlights the recent advances in algorithms for numerical linear algebra that have come from the technique of linear sketching, whereby given a matrix, one first compresses it to a much smaller matrix by multiplying it by a…
Sketching is a dimensionality reduction technique where one compresses a matrix by linear combinations that are chosen at random. A line of work has shown how to sketch the Hessian to speed up each iteration in a second order method, but…
Identifying independence between two random variables or correlated given their samples has been a fundamental problem in Statistics. However, how to do so in a space-efficient way if the number of states is large is not quite well-studied.…