Related papers: Subsampling Suffices for Adaptive Data Analysis

Algorithmic Stability for Adaptive Data Analysis

Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model,…

Machine Learning · Computer Science 2015-11-10 Raef Bassily , Kobbi Nissim , Adam Smith , Thomas Steinke , Uri Stemmer , Jonathan Ullman

More General Queries and Less Generalization Error in Adaptive Data Analysis

Adaptivity is an important feature of data analysis---typically the choice of questions asked about a dataset depends on previous interactions with the same dataset. However, generalization error is typically bounded in a non-adaptive…

Machine Learning · Computer Science 2015-11-11 Raef Bassily , Adam Smith , Thomas Steinke , Jonathan Ullman

Adaptive Threshold Sampling

Sampling is a fundamental problem in computer science and statistics. However, for a given task and stream, it is often not possible to choose good sampling probabilities in advance. We derive a general framework for adaptively changing the…

Machine Learning · Statistics 2022-06-16 Daniel Ting

Making Progress Based on False Discoveries

The study of adaptive data analysis examines how many statistical queries can be answered accurately using a fixed dataset while avoiding false discoveries (statistically inaccurate answers). In this paper, we tackle a question that…

Machine Learning · Computer Science 2023-02-09 Roi Livni

Adaptive Data Analysis for Growing Data

Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting,…

Machine Learning · Computer Science 2025-11-13 Neil G. Marchant , Benjamin I. P. Rubinstein

Tackling the subsampling problem to infer collective properties from limited data

Complex systems are fascinating because their rich macroscopic properties emerge from the interaction of many simple parts. Understanding the building principles of these emergent phenomena in nature requires assessing natural complex…

Neurons and Cognition · Quantitative Biology 2022-11-17 Anna Levina , Viola Priesemann , Johannes Zierenberg

Sampling Without Compromising Accuracy in Adaptive Data Analysis

In this work, we study how to use sampling to speed up mechanisms for answering adaptive queries into datasets without reducing the accuracy of those mechanisms. This is important to do when both the datasets and the number of queries asked…

Machine Learning · Computer Science 2020-01-03 Benjamin Fish , Lev Reyzin , Benjamin I. P. Rubinstein

Predictive Subsampling for Scalable Inference in Networks

Network datasets appear across a wide range of scientific fields, including biology, physics, and the social sciences. To enable data-driven discoveries from these networks, statistical inference techniques like estimation and hypothesis…

Methodology · Statistics 2026-02-19 Arpan Kumar , Minh Tang , Srijan Sengupta

Non-Adaptive Adaptive Sampling on Turnstile Streams

Adaptive sampling is a useful algorithmic tool for data summarization problems in the classical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly…

Data Structures and Algorithms · Computer Science 2020-04-24 Sepideh Mahabadi , Ilya Razenshteyn , David P. Woodruff , Samson Zhou

Tight Bounds for Answering Adaptively Chosen Concentrated Queries

Most work on adaptive data analysis assumes that samples in the dataset are independent. When correlations are allowed, even the non-adaptive setting can become intractable, unless some structural constraints are imposed. To address this,…

Data Structures and Algorithms · Computer Science 2025-11-13 Emma Rapoport , Edith Cohen , Uri Stemmer

Challenges in Bayesian Adaptive Data Analysis

Traditional statistical analysis requires that the analysis process and data are independent. By contrast, the new field of adaptive data analysis hopes to understand and provide algorithms and accuracy guarantees for research as it is…

Machine Learning · Computer Science 2017-03-22 Sam Elder

Subset Sampling and Its Extensions

This paper studies the \emph{subset sampling} problem. The input is a set $\mathcal{S}$ of $n$ records together with a function $\textbf{p}$ that assigns each record $v\in\mathcal{S}$ a probability $\textbf{p}(v)$. A query returns a random…

Data Structures and Algorithms · Computer Science 2023-07-24 Jinchao Huang , Sibo Wang

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…

Methodology · Statistics 2025-10-08 Amalan Mahendran , Helen Thompson , James M. McGree

Generalization in Adaptive Data Analysis and Holdout Reuse

Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis…

Machine Learning · Computer Science 2015-09-28 Cynthia Dwork , Vitaly Feldman , Moritz Hardt , Toniann Pitassi , Omer Reingold , Aaron Roth

Subsampling for General Statistics under Long Range Dependence with application to change point analysis

In the statistical inference for long range dependent time series the shape of the limit distribution typically depends on unknown parameters. Therefore, we propose to use subsampling. We show the validity of subsampling for general…

Statistics Theory · Mathematics 2016-10-20 Annika Betken , Martin Wendler

Optimal Sub-sampling with Influence Functions

Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the…

Machine Learning · Statistics 2017-09-07 Daniel Ting , Eric Brochu

Do We Really Sample Right In Model-Based Diagnosis?

Statistical samples, in order to be representative, have to be drawn from a population in a random and unbiased way. Nevertheless, it is common practice in the field of model-based diagnosis to make estimations from (biased) best-first…

Artificial Intelligence · Computer Science 2022-08-05 Patrick Rodler , Fatima Elichanova

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as…

Methodology · Statistics 2022-09-07 Amalan Mahendran , Helen Thompson , James M. McGree

The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

Adaptive data analysis has posed a challenge to science due to its ability to generate false hypotheses on moderately large data sets. In general, with non-adaptive data analyses (where queries to the data are generated without being…

Methodology · Statistics 2018-09-18 Preetum Nakkiran , Jarosław Błasiok

D-optimal Subsampling Design for Massive Data Linear Regression

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of…

Methodology · Statistics 2025-05-30 Torsten Glemser , Rainer Schwabe