Related papers: Weighted sampling without replacement
Hoeffding has shown that tail bounds on the distribution for sampling from a finite population with replacement also apply to the corresponding cases of sampling without replacement. (A special case of this result is that binomial tail…
Joining records with all other records that meet a linkage condition can result in an astronomically large number of combinations due to many-to-many relationships. For such challenging (acyclic) joins, a random sample over the join result…
Concentration inequalities quantify the deviation of a random variable from a fixed value. In spite of numerous applications, such as opinion surveys or ecological counting procedures, few concentration results are known for the setting of…
We present a new and simple approach to concentration inequalities for functions around their expectation with respect to non-product measures, i.e., for dependent random variables. Our method is based on coupling ideas and does not use…
Multistage sampling is commonly used for household surveys when there exists no sampling frame, or when the population is scattered over a wide area. Multistage sampling usually introduces a complex dependence in the selection of the final…
The concentration of measure phenomenon may be summarized as follows: a function of many weakly dependent random variables that is not too sensitive to any of its individual arguments will tend to take values very close to its expectation.…
We describe a very simple method for `consistent sampling' that allows for sampling with replacement. The method extends previous approaches to consistent sampling, which assign a pseudorandom real number to each element, and sample those…
We study concentration inequalities for structured weighted sums of random data, including (i) tensor inner products and (ii) sequential matrix sums. We are interested in tail bounds and concentration inequalities for those structured…
Ordered pivotal sampling is one of the simplest algorithm to perform without-replacement unequal probability sampling. It has found uses in the context of longitudinal surveys and spatial sampling, and enables in particular a good spatial…
Concentration bounds for non-product, non-Haar measures are fairly recent: the first such result was obtained for contracting Markov chains by Marton in 1996 via the coupling method. The work that followed, with few exceptions, also used…
In this work, we present a new random sampling method for data streams where the probability of an element's inclusion in the sample is proportional to a weight associated with that element. Our method is based on sampling with replacement,…
Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later use to estimate the total weight of arbitrary subsets. For this purpose, we propose priority sampling which tested on Internet…
We investigate the low-dimensional structure of deterministic transformations between random variables, i.e., transport maps between probability measures. In the context of statistics and machine learning, these transformations can be used…
This paper addresses the problem of estimating the containment and similarity between two sets using only random samples from each set, without relying on sketches of full sets. The study introduces a binomial model for predicting the…
We establish Hoeffding-type concentration inequalities for the low and high tail bounds of sums of exchangeable random variables. Our results exhibit an anti-symmetry in such tail bounds due to the assumption of exchangeability, a…
In this work we discuss two urn models with general weight sequences $(A,B)$ associated to them, $A=(\alpha_n)_{n\in\N}$ and $B=(\beta_m)_{m\in\N}$, generalizing two well known P\'olya-Eggenberger urn models, namely the so-called sampling…
We prove that a sum of random matrices generated by a $\psi$-mixing Markov chain has similar spectral properties to a Gaussian matrix with the same mean and covariance structure. This nonasymptotic universality principle enables sharp…
Time series datasets often contain heterogeneous signals, composed of both continuously changing quantities and discretely occurring events. The coupling between these measurements may provide insights into key underlying mechanisms of the…
Weighted sampling is a fundamental tool in data analysis and machine learning pipelines. Samples are used for efficient estimation of statistics or as sparse representations of the data. When weight distributions are skewed, as is often the…
Initially motivated by the study of the non-asymptotic properties of non-parametric tests based on permutation methods, concentration inequalities for uniformly permuted sums have been largely studied in the literature. Recently, Delyon et…