Related papers: A Sampling Algebra for Aggregate Estimation
An energy efficient use of large scale sensor networks necessitates activating a subset of possible sensors for estimation at a fusion center. The problem is inherently combinatorial; to this end, a set of iterative, randomized algorithms…
Surrogate models based on machine learning methods have become an important part of modern engineering to replace costly computer simulations. The data used for creating a surrogate model are essential for the model accuracy and often…
Distributed data aggregation is an important task, allowing the decentralized determination of meaningful global properties, that can then be used to direct the execution of other applications. The resulting values result from the…
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…
Statistical analysis of voluntary survey data is an important area of research in survey sampling. We consider a unified approach to voluntary survey data analysis under the assumption that the sampling mechanism is ignorable. Generalized…
Nonparametric estimation using uniform-width binning is a standard approach for evaluating the calibration performance of machine learning models. However, existing theoretical analyses of the bias induced by binning are limited to binary…
In order to scale standard Gaussian process (GP) regression to large-scale datasets, aggregation models employ factorized training process and then combine predictions from distributed experts. The state-of-the-art aggregation models,…
Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of…
For many applications, an ensemble of base classifiers is an effective solution. The tuning of its parameters(number of classes, amount of data on which each classifier is to be trained on, etc.) requires G, the generalization error of a…
A new, very general, robust procedure for combining estimators in metric spaces is introduced GROS. The method is reminiscent of the well-known median of means, as described in \cite{devroye2016sub}. Initially, the sample is divided into…
The need for accurate SQL progress estimation in the context of decision support administration has led to a number of techniques proposed for this task. Unfortunately, no single one of these progress estimators behaves robustly across the…
In this work, we analyze an efficient sampling-based algorithm for general-purpose reachability analysis, which remains a notoriously challenging problem with applications ranging from neural network verification to safety analysis of…
Identifying the underlying models in a set of data points contaminated by noise and outliers, leads to a highly complex multi-model fitting problem. This problem can be posed as a clustering problem by the projection of higher order…
Sampling-based algorithms are classical approaches to perform Bayesian inference in inverse problems. They provide estimators with the associated credibility intervals to quantify the uncertainty on the estimators. Although these methods…
Thompson sampling has emerged as an effective heuristic for a broad range of online decision problems. In its basic form, the algorithm requires computing and sampling from a posterior distribution over models, which is tractable only for…
We investigate the data distribution valuation problem, which aims to quantify the values of data distributions from their samples. This is a recently proposed problem that is related to but different from classical data valuation and can…
Gaussian process regression is a powerful Bayesian nonlinear regression method. Recent research has enabled the capture of many types of observations using non-Gaussian likelihoods. To deal with various tasks in spatial modeling, we benefit…
This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size $N$, $m_N$ subsamples are randomly drawn, and each…
Random sampling has become a critical tool in solving massive matrix problems. For linear regression, a small, manageable set of data rows can be randomly selected to approximate a tall, skinny data matrix, improving processing time…
Uncertainty sampling is a prevalent active learning algorithm that queries sequentially the annotations of data samples which the current prediction model is uncertain about. However, the usage of uncertainty sampling has been largely…