English
Related papers

Related papers: Diversity Subsampling: Custom Subsamples from Larg…

200 papers

Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information…

Methodology · Statistics 2024-07-10 Haolin Chen , Holger Dette , Jun Yu

arXiv:2206.10812v1 [stat.ME] proposes a useful algorithm, named generalized Diversity Subsampling (g-DS) algorithm, to select a subsample following some target probability distribution from a finite data set and demonstrates its…

Methodology · Statistics 2023-09-06 Boyang Shang

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…

Methodology · Statistics 2025-10-08 Amalan Mahendran , Helen Thompson , James M. McGree

In the time of Big Data, training complex models on large-scale data sets is challenging, making it appealing to reduce data volume for saving computation resources by subsampling. Most previous works in subsampling are weighted methods…

Machine Learning · Computer Science 2021-04-14 Zifeng Wang , Hong Zhu , Zhenhua Dong , Xiuqiang He , Shao-Lun Huang

Due to the recent cases of algorithmic bias in data-driven decision-making, machine learning methods are being put under the microscope in order to understand the root cause of these biases and how to correct them. Here, we consider a basic…

Machine Learning · Computer Science 2016-10-25 L. Elisa Celis , Amit Deshpande , Tarun Kathuria , Nisheeth K. Vishnoi

Sampling methods that choose a subset of the data proportional to its diversity in the feature space are popular for data summarization. However, recent studies have noted the occurrence of bias (under- or over-representation of a certain…

Machine Learning · Computer Science 2018-02-13 L. Elisa Celis , Vijay Keswani , Damian Straszak , Amit Deshpande , Tarun Kathuria , Nisheeth K. Vishnoi

Subsampling is commonly used to mitigate costs associated with data acquisition, such as time or energy requirements, motivating the development of algorithms for estimating the fully-sampled signal of interest $x$ from partially observed…

Machine Learning · Computer Science 2025-04-23 Oisin Nolan , Tristan S. W. Stevens , Wessel L. van Nierop , Ruud J. G. van Sloun

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of…

Methodology · Statistics 2025-05-30 Torsten Glemser , Rainer Schwabe

Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis…

Methodology · Statistics 2026-02-19 Arpan Kumar , Minh Tang , Srijan Sengupta

With the rising number of machine learning competitions, the world has witnessed an exciting race for the best algorithms. However, the involved data selection process may fundamentally suffer from evidence ambiguity and concept drift…

Machine Learning · Computer Science 2020-06-15 Hoang D. Nguyen , Xuan-Son Vu , Quoc-Tuan Truong , Duc-Trong Le

Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals…

Statistics Theory · Mathematics 2021-12-14 Dimitris N. Politis

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…

Data Structures and Algorithms · Computer Science 2007-05-23 Mario Szegedy , Mikkel Thorup

Faced with massive data, subsampling is a commonly used technique to improve computational efficiency, and using nonuniform subsampling probabilities is an effective approach to improve estimation efficiency. For computational efficiency,…

Statistics Theory · Mathematics 2022-05-19 Jing Wang , Jiahui Zou , HaiYing Wang

Semi-Supervised Learning (SSL) has become a preferred paradigm in many deep learning tasks, which reduces the need for human labor. Previous studies primarily focus on effectively utilising the labelled and unlabeled data to improve…

Machine Learning · Computer Science 2024-10-29 Qian Shao , Jiangrui Kang , Qiyuan Chen , Zepeng Li , Hongxia Xu , Yiwen Cao , Jiajuan Liang , Jian Wu

Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the…

Methodology · Statistics 2022-09-09 Mei Zhang , Yongdao Zhou , Zheng Zhou , Aijun Zhang

Distributed multi-party learning provides an effective approach for training a joint model with scattered data under legal and practical constraints. However, due to the quagmire of a skewed distribution of data labels across participants…

Machine Learning · Computer Science 2021-11-01 Maoguo Gong , Yuan Gao , Yue Wu , A. K. Qin

Nonuniform subsampling methods are effective to reduce computational burden and maintain estimation efficiency for massive data. Existing methods mostly focus on subsampling with replacement due to its high computational efficiency. If the…

Methodology · Statistics 2021-07-06 Jun Yu , HaiYing Wang , Mingyao Ai , Huiming Zhang

Huge amount of data is the key of the success of deep learning, however, redundant information impairs the generalization ability of the model and increases the burden of calculation. Dataset Distillation (DD) compresses the original…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Hangyang Kong , Wenbo Zhou , Xuxiang He , Xiaotong Tu , Xinghao Ding

A major challenge for building statistical models in the big data era is that the available data volume far exceeds the computational capability. A common approach for solving this problem is to employ a subsampled dataset that can be…

Computation · Statistics 2018-09-14 Lei Han , Kean Ming Tan , Ting Yang , Tong Zhang

The dramatic growth of big datasets presents a new challenge to data storage and analysis. Data reduction, or subsampling, that extracts useful information from datasets is a crucial step in big data analysis. We propose an orthogonal…

Methodology · Statistics 2021-06-01 Lin Wang , Jake Elmstedt , Weng Kee Wong , Hongquan Xu
‹ Prev 1 2 3 10 Next ›