Related papers: Preserving Statistical Validity in Adaptive Data A…

Preventing False Discovery in Interactive Data Analysis is Hard

We show that, under a standard hardness assumption, there is no computationally efficient algorithm that given $n$ samples from an unknown distribution can give valid answers to $n^{3+o(1)}$ adaptively chosen statistical queries. A…

Machine Learning · Computer Science 2014-08-08 Moritz Hardt , Jonathan Ullman

Generalization in Adaptive Data Analysis and Holdout Reuse

Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis…

Machine Learning · Computer Science 2015-09-28 Cynthia Dwork , Vitaly Feldman , Moritz Hardt , Toniann Pitassi , Omer Reingold , Aaron Roth

Valid inferential models for prediction in supervised learning problems

Prediction, where observed data is used to quantify uncertainty about a future observation, is a fundamental problem in statistics. Prediction sets with coverage probability guarantees are a common solution, but these do not provide…

Statistics Theory · Mathematics 2022-11-22 Leonardo Cella , Ryan Martin

Algorithmic Stability for Adaptive Data Analysis

Adaptivity is an important feature of data analysis---the choice of questions to ask about a dataset often depends on previous interactions with the same dataset. However, statistical validity is typically studied in a nonadaptive model,…

Machine Learning · Computer Science 2015-11-10 Raef Bassily , Kobbi Nissim , Adam Smith , Thomas Steinke , Uri Stemmer , Jonathan Ullman

A Theory of Statistical Inference for Ensuring the Robustness of Scientific Results

Inference is the process of using facts we know to learn about facts we do not know. A theory of inference gives assumptions necessary to get from the former to the latter, along with a definition for and summary of the resulting…

Machine Learning · Statistics 2021-09-27 Beau Coker , Cynthia Rudin , Gary King

Generalization for Adaptively-chosen Estimators via Stable Median

Datasets are often reused to perform multiple statistical analyses in an adaptive way, in which each analysis may depend on the outcomes of previous analyses on the same dataset. Standard statistical guarantees do not account for these…

Machine Learning · Computer Science 2017-06-19 Vitaly Feldman , Thomas Steinke

Information, Privacy and Stability in Adaptive Data Analysis

Traditional statistical theory assumes that the analysis to be performed on a given data set is selected independently of the data themselves. This assumption breaks downs when data are re-used across analyses and the analysis to be…

Machine Learning · Computer Science 2017-06-06 Adam Smith

Data Consistency Approach to Model Validation

In scientific inference problems, the underlying statistical modeling assumptions have a crucial impact on the end results. There exist, however, only a few automatic means for validating these fundamental modelling assumptions. The…

Methodology · Statistics 2019-05-21 Andreas Svensson , Dave Zachariah , Petre Stoica , Thomas B. Schön

Selective Randomization Inference for Adaptive Experiments

Adaptive experiments use preliminary analyses of the data to inform further course of action and are commonly used in many disciplines including medical and social sciences. Because the null hypothesis and experimental design are…

Methodology · Statistics 2026-05-26 Tobias Freidling , Qingyuan Zhao , Zijun Gao

Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from…

Machine Learning · Statistics 2025-12-08 Stephen Salerno , Kentaro Hoffman , Awan Afiaz , Anna Neufeld , Tyler H. McCormick , Jeffrey T. Leek

Controlling the False Discovery Proportion in Matched Observational Studies

We provide an approach to exploratory data analysis in matched observational studies with a single intervention and multiple endpoints. In such settings, the researcher would like to explore evidence for actual treatment effects among these…

Methodology · Statistics 2025-12-10 Mengqi Lin , Colin Fogarty

Semiparametric Efficient Inference in Adaptive Experiments

We consider the problem of efficient inference of the Average Treatment Effect in a sequential experiment where the policy governing the assignment of subjects to treatment or control can change over time. We first provide a central limit…

Machine Learning · Statistics 2024-03-05 Thomas Cook , Alan Mishler , Aaditya Ramdas

Statistical Inference with M-Estimators on Adaptively Collected Data

Bandit algorithms are increasingly used in real-world sequential decision-making problems. Associated with this is an increased desire to be able to use the resulting datasets to answer scientific questions like: Did one type of ad lead to…

Machine Learning · Computer Science 2021-11-23 Kelly W. Zhang , Lucas Janson , Susan A. Murphy

More General Queries and Less Generalization Error in Adaptive Data Analysis

Adaptivity is an important feature of data analysis---typically the choice of questions asked about a dataset depends on previous interactions with the same dataset. However, generalization error is typically bounded in a non-adaptive…

Machine Learning · Computer Science 2015-11-11 Raef Bassily , Adam Smith , Thomas Steinke , Jonathan Ullman

Robust Validation: Confident Predictions Even When Distributions Shift

While the traditional viewpoint in machine learning and statistics assumes training and testing samples come from the same population, practice belies this fiction. One strategy -- coming from robust statistics and optimization -- is thus…

Machine Learning · Statistics 2024-07-08 Maxime Cauchois , Suyash Gupta , Alnur Ali , John C. Duchi

Elements of Conformal Prediction for Statisticians

Predictive inference is a fundamental task in statistics, traditionally addressed using parametric assumptions about the data distribution and detailed analyses of how models learn from data. In recent years, conformal prediction has…

Methodology · Statistics 2026-03-26 Matteo Sesia , Stefano Favaro

Statistical Inference After Adaptive Sampling for Longitudinal Data

Online reinforcement learning and other adaptive sampling algorithms are increasingly used in digital intervention experiments to optimize treatment delivery for users over time. In this work, we focus on longitudinal user data collected by…

Machine Learning · Computer Science 2023-04-20 Kelly W. Zhang , Lucas Janson , Susan A. Murphy

A New Approach to Adaptive Data Analysis and Learning via Maximal Leakage

There is an increasing concern that most current published research findings are false. The main cause seems to lie in the fundamental disconnection between theory and practice in data analysis. While the former typically relies on…

Machine Learning · Statistics 2019-03-06 Amedeo Roberto Esposito , Michael Gastpar , Ibrahim Issa

The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

Adaptive data analysis has posed a challenge to science due to its ability to generate false hypotheses on moderately large data sets. In general, with non-adaptive data analyses (where queries to the data are generated without being…

Methodology · Statistics 2018-09-18 Preetum Nakkiran , Jarosław Błasiok

Maximum Fidelity

The most fundamental problem in statistics is the inference of an unknown probability distribution from a finite number of samples. For a specific observed data set, answers to the following questions would be desirable: (1) Estimation:…

Statistics Theory · Mathematics 2013-01-23 Ali Kinkhabwala