Related papers: Sharp Frequency Bounds for Sample-Based Queries

Statistical properties of sketching algorithms

Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a…

Methodology · Statistics 2019-04-04 Daniel Ahfock , William J. Astle , Sylvia Richardson

A new Frequency Estimation Sketch for Data Streams

In data stream applications, one of the critical issues is to estimate the frequency of each item in the specific multiset. The multiset means that each item in this set can appear multiple times. The data streams in many applications are…

Data Structures and Algorithms · Computer Science 2020-01-07 Ning Li

Space Lower Bounds for Itemset Frequency Sketches

Given a database, computing the fraction of rows that contain a query itemset or determining whether this fraction is above some threshold are fundamental operations in data mining. A uniform sample of rows is a good sketch of the database…

Data Structures and Algorithms · Computer Science 2016-03-10 Edo Liberty , Michael Mitzenmacher , Justin Thaler , Jonathan Ullman

Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries

This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge…

Methodology · Statistics 2023-08-17 Matteo Sesia , Stefano Favaro , Edgar Dobriban

Improved Frequency Estimation Algorithms with and without Predictions

Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically…

Data Structures and Algorithms · Computer Science 2023-12-13 Anders Aamand , Justin Y. Chen , Huy Lê Nguyen , Sandeep Silwal , Ali Vakilian

Statistical inference for sketching algorithms

Sketching algorithms use random projections to generate a smaller sketched data set, often for the purposes of modelling. Complete and partial sketch regression estimates can be constructed using information from only the sketched data set…

Methodology · Statistics 2023-06-07 R. P. Browne , J. L. Andrews

Sampling Large Data on Graphs

We consider the problem of sampling from data defined on the nodes of a weighted graph, where the edge weights capture the data correlation structure. As shown recently, using spectral graph theory one can define a cut-off frequency for the…

Information Theory · Computer Science 2014-11-13 Ilan Shomorony , A. Salman Avestimehr

A Framework for Statistical Inference via Randomized Algorithms

Randomized algorithms, such as randomized sketching or stochastic optimization, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs,…

Methodology · Statistics 2025-05-13 Zhixiang Zhang , Sokbae Lee , Edgar Dobriban

Conformal Frequency Estimation with Sketched Data

A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no…

Methodology · Statistics 2022-11-10 Matteo Sesia , Stefano Favaro

Validation of Matching

We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for matching algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds…

Machine Learning · Computer Science 2016-04-12 Ya Le , Eric Bax , Nicola Barbieri , David Garcia Soriano , Jitesh Mehta , James Li

Composable Sketches for Functions of Frequencies: Beyond the Worst Case

Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics…

Data Structures and Algorithms · Computer Science 2021-11-04 Edith Cohen , Ofir Geri , Rasmus Pagh

Efficient Anomaly Detection via Matrix Sketching

We consider the problem of finding anomalies in high-dimensional data using popular PCA based anomaly scores. The naive algorithms for computing these scores explicitly compute the PCA of the covariance matrix which uses space quadratic in…

Machine Learning · Computer Science 2018-11-28 Vatsal Sharan , Parikshit Gopalan , Udi Wieder

Statistical Mechanics of High-Dimensional Inference

To model modern large-scale datasets, we need efficient algorithms to infer a set of $P$ unknown model parameters from $N$ noisy measurements. What are fundamental limits on the accuracy of parameter inference, given finite signal-to-noise…

Machine Learning · Statistics 2016-09-07 Madhu Advani , Surya Ganguli

Fast Computation of Empirically Tight Bounds for the Diameter of Massive Graphs

The diameter of a graph is among its most basic parameters. Since a few years, it moreover became a key issue to compute it for massive graphs in the context of complex network analysis. However, known algorithms, including the ones…

Data Structures and Algorithms · Computer Science 2009-09-30 Clemence Magnien , Matthieu Latapy , Michel Habib

Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting…

Data Structures and Algorithms · Computer Science 2014-11-19 Madhav Jha , C. Seshadhri , Ali Pinar

Fast Concurrent Data Sketches

Data sketches are approximate succinct summaries of long streams. They are widely used for processing massive amounts of data and answering statistical queries about it in real-time. Existing libraries producing sketches are very fast, but…

Data Structures and Algorithms · Computer Science 2019-12-06 Arik Rinberg , Alexander Spiegelman , Edward Bortnikov , Eshcar Hillel , Idit Keidar , Lee Rhodes , Hadar Serviansky

Sketched Subspace Clustering

The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data.…

Machine Learning · Statistics 2018-02-08 Panagiotis A. Traganitis , Georgios B. Giannakis

(Learned) Frequency Estimation Algorithms under Zipfian Distribution

\begin{abstract} The frequencies of the elements in a data stream are an important statistical measure and the task of estimating them arises in many applications within data analysis and machine learning. Two of the most popular algorithms…

Data Structures and Algorithms · Computer Science 2020-08-12 Anders Aamand , Piotr Indyk , Ali Vakilian

Improving Compressed Counting

Compressed Counting (CC) [22] was recently proposed for estimating the ath frequency moments of data streams, where 0 < a <= 2. CC can be used for estimating Shannon entropy, which can be approximated by certain functions of the ath…

Data Structures and Algorithms · Computer Science 2012-05-14 Ping Li

Randomized Spectral Clustering in Large-Scale Stochastic Block Models

Spectral clustering has been one of the widely used methods for community detection in networks. However, large-scale networks bring computational challenges to the eigenvalue decomposition therein. In this paper, we study the spectral…

Social and Information Networks · Computer Science 2022-01-07 Hai Zhang , Xiao Guo , Xiangyu Chang