Related papers: Weighted Sampling Without Replacement from Data St…
In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement,…
In this work, we present a comprehensive treatment of weighted random sampling (WRS) over data streams. More precisely, we examine two natural interpretations of the item weights, describe an existing algorithm for each case ([2, 4]),…
Graph Sampling provides an efficient yet inexpensive solution for analyzing large graphs. While extracting small representative subgraphs from large graphs, the challenge is to capture the properties of the original graph. Several sampling…
This paper presents a novel algorithm solving the classic problem of generating a random sample of size s from population of size n with non-uniform probabilities. The sampling is done with replacement. The algorithm requires constant…
Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the…
As the popularity of graph data increases, there is a growing need to count the occurrences of subgraph patterns of interest, for a variety of applications. Many graphs are massive in scale and also fully dynamic (with insertions and…
We consider message-efficient continuous random sampling from a distributed stream, where the probability of inclusion of an item in the sample is proportional to a weight associated with the item. The unweighted version, where all weights…
A technique introduced by Indyk and Woodruff [STOC 2005] has inspired several recent advances in data-stream algorithms. We show that a number of these results follow easily from the application of a single probabilistic method called…
Most computational models of dependency syntax consist of distributions over spanning trees. However, the majority of dependency treebanks require that every valid dependency tree has a single edge coming out of the ROOT node, a constraint…
Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later use to estimate the total weight of arbitrary subsets. For this purpose, we propose priority sampling which tested on Internet…
Sampling is a fundamental technique, and sampling without replacement is often desirable when duplicate samples are not beneficial. Within machine learning, sampling is useful for generating diverse outputs from a trained model. We present…
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size $k$ that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of on-line reservoir…
Consider the fundamental problem of drawing a simple random sample of size k without replacement from [n] := {1, . . . , n}. Although a number of classical algorithms exist for this problem, we construct algorithms that are even simpler,…
We present the first feasible method for sampling a dynamic data stream with deletions, where the sample consists of pairs $(k,C_k)$ of a value $k$ and its exact total count $C_k$. Our algorithms are for both Strict Turnstile data streams…
Suppose an $n \times d$ design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to sample only a small number $k \ll n$ of the responses, and then produce a…
We analyze the convergence rates of stochastic gradient algorithms for smooth finite-sum minimax optimization and show that, for many such algorithms, sampling the data points without replacement leads to faster convergence compared to…
We consider communication-efficient weighted and unweighted (uniform) random sampling from distributed data streams presented as a sequence of mini-batches of items. This is a natural model for distributed streaming computation, and our…
To tackle massive data, subsampling is a practical approach to select the more informative data points. However, when responses are expensive to measure, developing efficient subsampling schemes is challenging, and an optimal sampling…
Sampling without replacement is a natural online rounding strategy for converting fractional bipartite matching into an integral one. In Online Bipartite Matching, we can use the Balance algorithm to fractionally match each online vertex,…
The combined algorithm selection and hyperparameter tuning (CASH) problem is characterized by large hierarchical hyperparameter spaces. Model-free hyperparameter tuning methods can explore such large spaces efficiently since they are highly…