Related papers: Subsample-Based Estimation under Dynamic Contamina…

Consistent Regression using Data-Dependent Coverings

In this paper, we introduce a novel method to generate interpretable regression function estimators. The idea is based on called data-dependent coverings. The aim is to extract from the data a covering of the feature space instead of a…

Statistics Theory · Mathematics 2021-01-27 Vincent Margot , Jean-Patrick Baudry , Frédéric Guilloux , Olivier Wintenberger

Robust semiparametric inference with missing data

Classical semiparametric inference with missing outcome data is not robust to contamination of the observed data and a single observation can have arbitrarily large influence on estimation of a parameter of interest. This sensitivity is…

Methodology · Statistics 2021-03-02 Eva Cantoni , Xavier de Luna

Robust Estimation and Inference for Categorical Data

While there is a rich literature on robust methodologies for contamination in continuously distributed data, contamination in categorical data is largely overlooked. This is regrettable because many datasets are categorical and oftentimes…

Methodology · Statistics 2024-12-13 Max Welz

Many Experiments, Few Repetitions, Unpaired Data, and Sparse Effects: Is Causal Inference Possible?

We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates $X$ and an outcome $Y$ under different experimental conditions (environments) but do not observe…

Machine Learning · Statistics 2026-01-22 Felix Schur , Niklas Pfister , Peng Ding , Sach Mukherjee , Jonas Peters

Classification under Data Contamination with Application to Remote Sensing Image Mis-registration

This work is motivated by the problem of image mis-registration in remote sensing and we are interested in determining the resulting loss in the accuracy of pattern classification. A statistical formulation is given where we propose to use…

Methodology · Statistics 2015-03-17 Donghui Yan , Peng Gong , Aiyou Chen , Liheng Zhong

Propagation of outliers in multivariate data

We investigate the performance of robust estimates of multivariate location under nonstandard data contamination models such as componentwise outliers (i.e., contamination in each variable is independent from the other variables). This…

Statistics Theory · Mathematics 2009-03-04 Fatemah Alqallaf , Stefan Van Aelst , Victor J. Yohai , Ruben H. Zamar

From Collapse to Improvement: Statistical Perspectives on the Evolutionary Dynamics of Iterative Training on Contaminated Sources

The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a…

Machine Learning · Statistics 2026-02-19 Soham Bakshi , Sunrit Chakraborty

Split Conformal Prediction under Data Contamination

Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on…

Machine Learning · Statistics 2025-12-01 Jase Clarkson , Wenkai Xu , Mihai Cucuringu , Yvik Swan , Gesine Reinert

Simulations evaluating resampling methods for causal discovery: ensemble performance and calibration

Causal discovery can be a powerful tool for investigating causality when a system can be observed but is inaccessible to experiments in practice. Despite this, it is rarely used in any scientific or medical fields. One of the major hurdles…

Machine Learning · Statistics 2019-10-07 Erich Kummerfeld , Alexander Rix

Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

We study the basic task of mean estimation in the presence of mean-shift contamination. In the mean-shift contamination model, an adversary is allowed to replace a small constant fraction of the clean samples by samples drawn from…

Machine Learning · Computer Science 2026-02-27 Ilias Diakonikolas , Giannis Iakovidis , Daniel M. Kane , Sihan Liu

Practical Insights of Repairing Model Problems on Image Classification

Additional training of a deep learning model can cause negative effects on the results, turning an initially positive sample into a negative one (degradation). Such degradation is possible in real-world use cases due to the diversity of…

Machine Learning · Computer Science 2022-05-19 Akihito Yoshii , Susumu Tokumoto , Fuyuki Ishikawa

Robust subset selection

The best subset selection (or "best subsets") estimator is a classic tool for sparse regression, and developments in mathematical optimization over the past decade have made it more computationally tractable than ever. Notwithstanding its…

Methodology · Statistics 2022-01-11 Ryan Thompson

Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift…

Machine Learning · Statistics 2026-02-19 Nail B. Khelifa , Richard E. Turner , Ramji Venkataramanan

Fast and Robust Least Squares Estimation in Corrupted Linear Models

Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates. The concept of influence…

Machine Learning · Statistics 2014-06-20 Brian McWilliams , Gabriel Krummenacher , Mario Lucic , Joachim M. Buhmann

Target Robust Discriminant Analysis

In practice, the data distribution at test time often differs, to a smaller or larger extent, from that of the original training data. Consequentially, the so-called source classifier, trained on the available labelled data, deteriorates on…

Machine Learning · Statistics 2021-06-18 Wouter M. Kouw , Marco Loog

Inferring collective dynamical states from widely unobserved systems

When assessing spatially-extended complex systems, one can rarely sample the states of all components. We show that this spatial subsampling typically leads to severe underestimation of the risk of instability in systems with propagating…

Data Analysis, Statistics and Probability · Physics 2018-07-06 Jens Wilting , Viola Priesemann

High-dimensional robust precision matrix estimation: Cellwise corruption under $\epsilon$-contamination

We analyze the statistical consistency of robust estimators for precision matrices in high dimensions. We focus on a contamination mechanism acting cellwise on the data matrix. The estimators we analyze are formed by plugging appropriately…

Statistics Theory · Mathematics 2015-09-25 Po-Ling Loh , Xin Lu Tan

Contamination Estimation via Convex Relaxations

Identifying anomalies and contamination in datasets is important in a wide variety of settings. In this paper, we describe a new technique for estimating contamination in large, discrete valued datasets. Our approach considers the normal…

Information Theory · Computer Science 2015-06-16 Matthew L. Malloy , Scott Alfeld , Paul Barford

Diffusion Transformers for Imputation: Statistical Efficiency and Uncertainty Quantification

Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success…

Machine Learning · Computer Science 2025-10-03 Zeqi Ye , Minshuo Chen

Frustratingly Easy Uncertainty Estimation for Distribution Shift

Distribution shift is an important concern in deep image classification, produced either by corruption of the source images, or a complete change, with the solution involving domain adaptation. While the primary goal is to improve accuracy…

Machine Learning · Statistics 2021-10-19 Tiago Salvador , Vikram Voleti , Alexander Iannantuono , Adam Oberman