Related papers: Training Set Debugging Using Trusted Items

Provable Training Set Debugging for Linear Regression

We investigate problems in penalized $M$-estimation, inspired by applications in machine learning debugging. Data are collected from two pools, one containing data with possibly contaminated labels, and the other which is known to contain…

Machine Learning · Computer Science 2021-08-11 Xiaomin Zhang , Xiaojin Zhu , Po-Ling Loh

Towards Training Set Reduction for Bug Triage

Bug triage is an important step in the process of bug fixing. The goal of bug triage is to assign a new-coming bug to the correct potential developer. The existing bug triage approaches are based on machine learning algorithms, which build…

Software Engineering · Computer Science 2017-03-14 Weiqin Zou , Yan Hu , Jifeng Xuan , He Jiang

Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise

The growing importance of massive datasets used for deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling, non-expert labeling, and label corruption by…

Machine Learning · Computer Science 2019-01-30 Dan Hendrycks , Mantas Mazeika , Duncan Wilson , Kevin Gimpel

Learning with Bad Training Data via Iterative Trimmed Loss Minimization

In this paper, we study a simple and generic framework to tackle the problem of learning model parameters when a fraction of the training samples are corrupted. We first make a simple observation: in a variety of such settings, the…

Machine Learning · Computer Science 2019-02-20 Yanyao Shen , Sujay Sanghavi

Robust Federated Training via Collaborative Machine Teaching using Trusted Instances

Federated learning performs distributed model training using local data hosted by agents. It shares only model parameter updates for iterative aggregation at the server. Although it is privacy-preserving by design, federated learning is…

Machine Learning · Computer Science 2019-05-09 Yufei Han , Xiangliang Zhang

Modelling Concurrency Bugs Using Machine Learning

Artificial Intelligence has gained a lot of traction in the recent years, with machine learning notably starting to see more applications across a varied range of fields. One specific machine learning application that is of interest to us…

Software Engineering · Computer Science 2023-05-10 Teodor Rares Begu

Certifying Data-Bias Robustness in Linear Regression

Datasets typically contain inaccuracies due to human error and societal biases, and these inaccuracies can affect the outcomes of models trained on such datasets. We present a technique for certifying whether linear regression models are…

Machine Learning · Computer Science 2022-06-09 Anna P. Meyer , Aws Albarghouthi , Loris D'Antoni

Benchmarking Machine Learning Technologies for Software Defect Detection

Machine Learning approaches are good in solving problems that have less information. In most cases, the software domain problems characterize as a process of learning that depend on the various circumstances and changes accordingly. A…

Software Engineering · Computer Science 2015-06-26 Saiqa Aleem , Luiz Fernando Capretz , Faheem Ahmed

Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks

We identify label errors in the test sets of 10 of the most commonly-used computer vision, natural language, and audio datasets, and subsequently study the potential for these label errors to affect benchmark results. Errors in test sets…

Machine Learning · Statistics 2021-11-09 Curtis G. Northcutt , Anish Athalye , Jonas Mueller

Towards Testing of Deep Learning Systems with Training Set Reduction

Testing the implementation of deep learning systems and their training routines is crucial to maintain a reliable code base. Modern software development employs processes, such as Continuous Integration, in which changes to the software are…

Machine Learning · Statistics 2019-01-15 Helge Spieker , Arnaud Gotlieb

Training Set Camouflage

We introduce a form of steganography in the domain of machine learning which we call training set camouflage. Imagine Alice has a training set on an illicit machine learning classification task. Alice wants Bob (a machine learning system)…

Cryptography and Security · Computer Science 2018-12-17 Ayon Sen , Scott Alfeld , Xuezhou Zhang , Ara Vartanian , Yuzhe Ma , Xiaojin Zhu

Improved Training for Self-Training by Confidence Assessments

It is well known that for some tasks, labeled data sets may be hard to gather. Therefore, we wished to tackle here the problem of having insufficient training data. We examined learning methods from unlabeled data after an initial training…

Machine Learning · Computer Science 2018-04-06 Gal Hyams , Daniel Greenfeld , Dor Bank

Deep Learning for Bug-Localization in Student Programs

Providing feedback is an integral part of teaching. Most open online courses on programming make use of automated grading systems to support programming assignments and give real-time feedback. These systems usually rely on test results to…

Software Engineering · Computer Science 2019-05-30 Rahul Gupta , Aditya Kanade , Shirish Shevade

Data Programming: Creating Large Training Sets, Quickly

Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive…

Machine Learning · Statistics 2018-12-10 Alexander Ratner , Christopher De Sa , Sen Wu , Daniel Selsam , Christopher Ré

Classifier-Guided Visual Correction of Noisy Labels for Image Classification Tasks

Training data plays an essential role in modern applications of machine learning. However, gathering labeled training data is time-consuming. Therefore, labeling is often outsourced to less experienced users, or completely automated. This…

Computer Vision and Pattern Recognition · Computer Science 2020-06-11 Alex Bäuerle , Heiko Neumann , Timo Ropinski

Set-Based Training for Neural Network Verification

Neural networks are vulnerable to adversarial attacks, i.e., small input perturbations can significantly affect the outputs of a neural network. Therefore, to ensure safety of neural networks in safety-critical environments, the robustness…

Machine Learning · Computer Science 2025-08-06 Lukas Koller , Tobias Ladner , Matthias Althoff

Error-Bounded Correction of Noisy Labels

To collect large scale annotated data, it is inevitable to introduce label noise, i.e., incorrect class labels. To be robust against label noise, many successful methods rely on the noisy classifiers (i.e., models trained on the noisy…

Computer Vision and Pattern Recognition · Computer Science 2020-11-23 Songzhu Zheng , Pengxiang Wu , Aman Goswami , Mayank Goswami , Dimitris Metaxas , Chao Chen

A Bug or a Suggestion? An Automatic Way to Label Issues

More and more users and developers are using Issue Tracking Systems (ITSs) to report issues, including bugs, feature requests, enhancement suggestions, etc. Different information, however, is gathered from users when issues are reported on…

Software Engineering · Computer Science 2019-09-04 Yuxiang Zhu , Minxue Pan , Yu Pei , Tian Zhang

Enhancing Self-Training Methods

Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data. Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias" that…

Machine Learning · Computer Science 2023-01-19 Aswathnarayan Radhakrishnan , Jim Davis , Zachary Rabin , Benjamin Lewis , Matthew Scherreik , Roman Ilin

Machine Learning from Explanations

Acquiring and training on large-scale labeled data can be impractical due to cost constraints. Additionally, the use of small training datasets can result in considerable variability in model outcomes, overfitting, and learning of spurious…

Machine Learning · Computer Science 2025-07-08 Jiashu Tao , Reza Shokri