Related papers: Provable Training Set Debugging for Linear Regress…
Training set bugs are flaws in the data that adversely affect machine learning. The training set is usually too large for man- ual inspection, but one may have the resources to verify a few trusted items. The set of trusted items may not by…
Datasets typically contain inaccuracies due to human error and societal biases, and these inaccuracies can affect the outcomes of models trained on such datasets. We present a technique for certifying whether linear regression models are…
Motivated by value function estimation in reinforcement learning, we study statistical linear inverse problems, i.e., problems where the coefficients of a linear system to be solved are observed in noise. We consider penalized estimators,…
In a binary classification problem where the goal is to fit an accurate predictor, the presence of corrupted labels in the training data set may create an additional challenge. However, in settings where likelihood maximization is poorly…
Large language models (LLMs) are increasingly exposed to data contamination, i.e., performance gains driven by prior exposure of test datasets rather than generalization. However, in the context of tabular data, this problem is largely…
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for…
Data debugging is to find a subset of the training data such that the model obtained by retraining on the subset has a better accuracy. A bunch of heuristic approaches are proposed, however, none of them are guaranteed to solve this problem…
Noisy training set usually leads to the degradation of generalization and robustness of neural networks. In this paper, we propose using a theoretically guaranteed noisy label detection framework to detect and remove noisy data for Learning…
We study binary classification in the setting where the learner is presented with multiple corrupted training samples, with possibly different sample sizes and degrees of corruption, and introduce an approach based on minimizing a weighted…
We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite…
Deep Neural Networks are well known for efficiently fitting training data, yet experiencing poor generalization capabilities whenever some kind of bias dominates over the actual task labels, resulting in models learning "shortcuts". In…
Machine learning-based program analyses have recently shown the promise of integrating formal and probabilistic reasoning towards aiding software development. However, in the absence of large annotated corpora, training these analyses is…
A novel correction algorithm is proposed for multi-class classification problems with corrupted training data. The algorithm is non-intrusive, in the sense that it post-processes a trained classification model by adding a correction…
In this paper, we study a simple and generic framework to tackle the problem of learning model parameters when a fraction of the training samples are corrupted. We first make a simple observation: in a variety of such settings, the…
To collect large scale annotated data, it is inevitable to introduce label noise, i.e., incorrect class labels. To be robust against label noise, many successful methods rely on the noisy classifiers (i.e., models trained on the noisy…
The perturbation analysis of linear solvers applied to systems arising broadly in machine learning settings -- for instance, when using linear regression models -- establishes an important perspective when reframing these analyses through…
Regularized linear regression under the $\ell_1$ penalty, such as the Lasso, has been shown to be effective in variable selection and sparse modeling. The sampling distribution of an $\ell_1$-penalized estimator $\hat{\beta}$ is hard to…
We typically compute aggregate statistics on held-out test data to assess the generalization of machine learning models. However, statistics on test data often overstate model generalization, and thus, the performance of deployed machine…
In many applications labeled data is not readily available, and needs to be collected via pain-staking human supervision. We propose a rule-exemplar method for collecting human supervision to combine the efficiency of rules with the quality…
Penalized regression methods, most notably the lasso, are a popular approach to analyzing high-dimensional data. An attractive property of the lasso is that it naturally performs variable selection. An important area of concern, however, is…