Related papers: Diversity Subsampling: Custom Subsamples from Larg…

Multi-resolution subsampling for large-scale linear classification

Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information…

Methodology · Statistics 2024-07-10 Haolin Chen , Holger Dette , Jun Yu

On Convergence Rate of the Generalized Diversity Subsampling Method

arXiv:2206.10812v1 [stat.ME] proposes a useful algorithm, named generalized Diversity Subsampling (g-DS) algorithm, to select a subsample following some target probability distribution from a finite data set and demonstrates its…

Methodology · Statistics 2023-09-06 Boyang Shang

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…

Methodology · Statistics 2025-10-08 Amalan Mahendran , Helen Thompson , James M. McGree

Less Is Better: Unweighted Data Subsampling via Influence Function

In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods…

Machine Learning · Computer Science 2021-04-14 Zifeng Wang , Hong Zhu , Zhenhua Dong , Xiuqiang He , Shao-Lun Huang

How to be Fair and Diverse?

Due to the recent cases of algorithmic bias in data-driven decision-making, machine learning methods are being put under the microscope in order to understand the root cause of these biases and how to correct them. Here, we consider a basic…

Machine Learning · Computer Science 2016-10-25 L. Elisa Celis , Amit Deshpande , Tarun Kathuria , Nisheeth K. Vishnoi

Fair and Diverse DPP-based Data Summarization

Sampling methods that choose a subset of the data proportional to its diversity in the feature space are popular for data summarization. However, recent studies have noted the occurrence of bias (under- or over-representation of a certain…

Machine Learning · Computer Science 2018-02-13 L. Elisa Celis , Vijay Keswani , Damian Straszak , Amit Deshpande , Tarun Kathuria , Nisheeth K. Vishnoi

Active Diffusion Subsampling

Subsampling is commonly used to mitigate costs associated with data acquisition, such as time or energy requirements, motivating the development of algorithms for estimating the fully-sampled signal of interest $x$ from partially observed…

Machine Learning · Computer Science 2025-04-23 Oisin Nolan , Tristan S. W. Stevens , Wessel L. van Nierop , Ruud J. G. van Sloun

D-optimal Subsampling Design for Massive Data Linear Regression

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of…

Methodology · Statistics 2025-05-30 Torsten Glemser , Rainer Schwabe

Predictive Subsampling for Scalable Inference in Networks

Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis…

Methodology · Statistics 2026-02-19 Arpan Kumar , Minh Tang , Srijan Sengupta

Reinforced Data Sampling for Model Diversification

With the rising number of machine learning competitions, the world has witnessed an exciting race for the best algorithms. However, the involved data selection process may fundamentally suffer from evidence ambiguity and concept drift…

Machine Learning · Computer Science 2020-06-15 Hoang D. Nguyen , Xuan-Son Vu , Quoc-Tuan Truong , Duc-Trong Le

Scalable subsampling: computation, aggregation and inference

Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals…

Statistics Theory · Mathematics 2021-12-14 Dimitris N. Politis

On the variance of subset sum estimation

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…

Data Structures and Algorithms · Computer Science 2007-05-23 Mario Szegedy , Mikkel Thorup

Sampling with replacement vs Poisson sampling: a comparative study in optimal subsampling

Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency,…

Statistics Theory · Mathematics 2022-05-19 Jing Wang , Jiahui Zou , HaiYing Wang

Enhancing Semi-Supervised Learning via Representative and Diverse Sample Selection

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve…

Machine Learning · Computer Science 2024-10-29 Qian Shao , Jiangrui Kang , Qiyuan Chen , Zepeng Li , Hongxia Xu , Yiwen Cao , Jiajuan Liang , Jian Wu

Model-free Subsampling Method Based on Uniform Designs

Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the…

Methodology · Statistics 2022-09-09 Mei Zhang , Yongdao Zhou , Zheng Zhou , Aijun Zhang

ADDS: Adaptive Differentiable Sampling for Robust Multi-Party Learning

Distributed multi-party learning provides an effective approach for training a joint model with scattered data under legal and practical constraints. However, due to the quagmire of a skewed distribution of data labels across participants…

Machine Learning · Computer Science 2021-11-01 Maoguo Gong , Yuan Gao , Yue Wu , A. K. Qin

Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators with Massive Data

Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the…

Methodology · Statistics 2021-07-06 Jun Yu , HaiYing Wang , Mingyao Ai , Huiming Zhang

Efficient Dataset Distillation through Low-Rank Space Sampling

Huge amount of data is the key of the success of deep learning, however, redundant information impairs the generalization ability of the model and increases the burden of calculation. Dataset Distillation (DD) compresses the original…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Hangyang Kong , Wenbo Zhou , Xuxiang He , Xiaotong Tu , Xinghao Ding

Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…

Computation · Statistics 2018-09-14 Lei Han , Kean Ming Tan , Ting Yang , Tong Zhang

Orthogonal Subsampling for Big Data Linear Regression

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal…

Methodology · Statistics 2021-06-01 Lin Wang , Jake Elmstedt , Weng Kee Wong , Hongquan Xu