Related papers: Diversity Subsampling: Custom Subsamples from Larg…
Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information…
arXiv:2206.10812v1 [stat.ME] proposes a useful algorithm, named generalized Diversity Subsampling (g-DS) algorithm, to select a subsample following some target probability distribution from a finite data set and demonstrates its…
Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…
In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods…
Due to the recent cases of algorithmic bias in data-driven decision-making, machine learning methods are being put under the microscope in order to understand the root cause of these biases and how to correct them. Here, we consider a basic…
Sampling methods that choose a subset of the data proportional to its diversity in the feature space are popular for data summarization. However, recent studies have noted the occurrence of bias (under- or over-representation of a certain…
Subsampling is commonly used to mitigate costs associated with data acquisition, such as time or energy requirements, motivating the development of algorithms for estimating the fully-sampled signal of interest $x$ from partially observed…
Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of…
Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis…
With the rising number of machine learning competitions, the world has witnessed an exciting race for the best algorithms. However, the involved data selection process may fundamentally suffer from evidence ambiguity and concept drift…
Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals…
For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…
Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency,…
Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve…
Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the…
Distributed multi-party learning provides an effective approach for training a joint model with scattered data under legal and practical constraints. However, due to the quagmire of a skewed distribution of data labels across participants…
Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the…
Huge amount of data is the key of the success of deep learning, however, redundant information impairs the generalization ability of the model and increases the burden of calculation. Dataset Distillation (DD) compresses the original…
A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…
The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal…