Related papers: Subsample-Based Estimation under Dynamic Contamina…
In this paper, we introduce a novel method to generate interpretable regression function estimators. The idea is based on called data-dependent coverings. The aim is to extract from the data a covering of the feature space instead of a…
Classical semiparametric inference with missing outcome data is not robust to contamination of the observed data and a single observation can have arbitrarily large influence on estimation of a parameter of interest. This sensitivity is…
While there is a rich literature on robust methodologies for contamination in continuously distributed data, contamination in categorical data is largely overlooked. This is regrettable because many datasets are categorical and oftentimes…
We study the problem of estimating causal effects under hidden confounding in the following unpaired data setting: we observe some covariates $X$ and an outcome $Y$ under different experimental conditions (environments) but do not observe…
This work is motivated by the problem of image mis-registration in remote sensing and we are interested in determining the resulting loss in the accuracy of pattern classification. A statistical formulation is given where we propose to use…
We investigate the performance of robust estimates of multivariate location under nonstandard data contamination models such as componentwise outliers (i.e., contamination in each variable is independent from the other variables). This…
The problem of model collapse has presented new challenges in iterative training of generative models, where such training with synthetic data leads to an overall degradation of performance. This paper looks at the problem from a…
Conformal prediction is a non-parametric technique for constructing prediction intervals or sets from arbitrary predictive models under the assumption that the data is exchangeable. It is popular as it comes with theoretical guarantees on…
Causal discovery can be a powerful tool for investigating causality when a system can be observed but is inaccessible to experiments in practice. Despite this, it is rarely used in any scientific or medical fields. One of the major hurdles…
We study the basic task of mean estimation in the presence of mean-shift contamination. In the mean-shift contamination model, an adversary is allowed to replace a small constant fraction of the clean samples by samples drawn from…
Additional training of a deep learning model can cause negative effects on the results, turning an initially positive sample into a negative one (degradation). Such degradation is possible in real-world use cases due to the diversity of…
The best subset selection (or "best subsets") estimator is a classic tool for sparse regression, and developments in mathematical optimization over the past decade have made it more computationally tractable than ever. Notwithstanding its…
Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift…
Subsampling methods have been recently proposed to speed up least squares estimation in large scale settings. However, these algorithms are typically not robust to outliers or corruptions in the observed covariates. The concept of influence…
In practice, the data distribution at test time often differs, to a smaller or larger extent, from that of the original training data. Consequentially, the so-called source classifier, trained on the available labelled data, deteriorates on…
When assessing spatially-extended complex systems, one can rarely sample the states of all components. We show that this spatial subsampling typically leads to severe underestimation of the risk of instability in systems with propagating…
We analyze the statistical consistency of robust estimators for precision matrices in high dimensions. We focus on a contamination mechanism acting cellwise on the data matrix. The estimators we analyze are formed by plugging appropriately…
Identifying anomalies and contamination in datasets is important in a wide variety of settings. In this paper, we describe a new technique for estimating contamination in large, discrete valued datasets. Our approach considers the normal…
Imputation methods play a critical role in enhancing the quality of practical time-series data, which often suffer from pervasive missing values. Recently, diffusion-based generative imputation methods have demonstrated remarkable success…
Distribution shift is an important concern in deep image classification, produced either by corruption of the source images, or a complete change, with the solution involving domain adaptation. While the primary goal is to improve accuracy…