Related papers: Sampling with Costs
There is a growing body of work on sorting and selection in models other than the unit-cost comparison model. This work is the first treatment of a natural stochastic variant of the problem where the cost of comparing two elements is a…
We compute the integral of a function or the expectation of a random variable with minimal cost and use, for our new algorithm and for upper bounds of the complexity, i.i.d. samples. Under certain assumptions it is possible to select a…
We study the problem of efficiently estimating counts for queries involving complex filters, such as user-defined functions, or predicates involving self-joins and correlated subqueries. For such queries, traditional sampling techniques may…
The basic idea of importance sampling is to use independent samples from a proposal measure in order to approximate expectations with respect to a target measure. It is key to understand how many samples are required in order to guarantee…
Starting with a set of weighted items, we want to create a generic sample of a certain size that we can later use to estimate the total weight of arbitrary subsets. For this purpose, we propose priority sampling which tested on Internet…
Statistical samples, in order to be representative, have to be drawn from a population in a random and unbiased way. Nevertheless, it is common practice in the field of model-based diagnosis to make estimations from (biased) best-first…
Many applications require the collection of data on different variables or measurements over many system performance metrics. We term those broadly as measures or variables. Often data collection along each measure incurs a cost, thus it is…
Crowdsourcing provides a popular paradigm for data collection at scale. We study the problem of selecting subsets of workers from a given worker pool to maximize the accuracy under a budget constraint. One natural question is whether we…
We consider the problem of deciding on sampling strategy, in particular sampling design. We propose a risk measure, whose minimizing value guides the choice. The method makes use of a superpopulation model and takes into account uncertainty…
We consider a model where an agent has a repeated decision to make and wishes to maximize their total payoff. Payoffs are influenced by an action taken by the agent, but also an unknown state of the world that evolves over time. Before…
Imbalanced data is a frequently encountered problem in machine learning. Despite a vast amount of literature on sampling techniques for imbalanced data, there is a limited number of studies that address the issue of the optimal sampling…
Running machine learning algorithms on large and rapidly growing volumes of data is often computationally expensive, one common trick to reduce the size of a data set, and thus reduce the computational cost of machine learning algorithms,…
Estimating the prevalence of a disease is necessary for evaluating and mitigating risks of its transmission within or between populations. Estimates that consider how prevalence changes with time provide more information about these risks…
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…
For many tasks of data analysis, we may only have the information of the explanatory variable and the evaluation of the response values are quite expensive. While it is impractical or too costly to obtain the responses of all units, a…
The note studies the problem of selecting a good enough subset out of a finite number of alternatives under a fixed simulation budget. Our work aims to maximize the posterior probability of correctly selecting a good subset. We formulate…
The random cost problem is the problem of finding the minimum in an exponentially long list of random numbers. By definition, this problem cannot be solved faster than by exhaustive search. It is shown that a classical NP-hard optimization…
We study sequential experiments where sampling is costly and a decision-maker aims to determine the best treatment for full scale implementation by (1) adaptively allocating units between two possible treatments, and (2) stopping the…
Almost every software system provides configuration options to tailor the system to the target platform and application scenario. Often, this configurability renders the analysis of every individual system configuration infeasible. To…
Adequate sampling is essential for the well-functioning of a market surveillance system. As small as possible statistically significant sample size is the main factor that determines the costs of market surveillance actions. This paper…