Related papers: Characterizing instance hardness in classification…

Data Complexity: A New Perspective for Analyzing the Difficulty of Defect Prediction Tasks

Defect prediction is crucial for software quality assurance and has been extensively researched over recent decades. However, prior studies rarely focus on data complexity in defect prediction tasks, and even less on understanding the…

Software Engineering · Computer Science 2023-05-08 Xiaohui Wan , Zheng Zheng , Fangyun Qin , Xuhui Lu

PyHard: a novel tool for generating hardness embeddings to support data-centric analysis

For building successful Machine Learning (ML) systems, it is imperative to have high quality data and well tuned learning models. But how can one assess the quality of a given dataset? And how can the strengths and weaknesses of a model on…

Machine Learning · Computer Science 2021-09-30 Pedro Yuri Arbs Paiva , Kate Smith-Miles , Maria Gabriela Valeriano , Ana Carolina Lorena

Learning Rules-First Classifiers

Complex classifiers may exhibit "embarassing" failures in cases where humans can easily provide a justified classification. Avoiding such failures is obviously of key importance. In this work, we focus on one such setting, where a label is…

Machine Learning · Computer Science 2019-06-14 Deborah Cohen , Amit Daniely , Amir Globerson , Gal Elidan

HardVis: Visual Analytics to Handle Instance Hardness Using Undersampling and Oversampling Techniques

Despite the tremendous advances in machine learning (ML), training with imbalanced data still poses challenges in many real-world applications. Among a series of diverse techniques to solve this problem, sampling algorithms are regarded as…

Machine Learning · Computer Science 2024-04-19 Angelos Chatzimparmpas , Fernando V. Paulovich , Andreas Kerren

Classifier Data Quality: A Geometric Complexity Based Method for Automated Baseline And Insights Generation

Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical…

Machine Learning · Computer Science 2022-10-28 George Kour , Marcel Zalmanovici , Orna Raz , Samuel Ackerman , Ateret Anaby-Tavor

Towards Difficulty-Aware Analysis of Deep Neural Networks

Traditional instance-based model analysis focuses mainly on misclassified instances. However, this approach overlooks the varying difficulty associated with different instances. Ideally, a robust model should recognize and reflect the…

Human-Computer Interaction · Computer Science 2025-07-02 Linhao Meng , Stef van den Elzen , Anna Vilanova

Meta-Instance Selection. Instance Selection as a Classification Problem with Meta-Features

Data pruning, or instance selection, is an important problem in machine learning especially in terms of nearest neighbour classifier. However, in data pruning which speeds up the prediction phase, there is an issue related to the speed and…

Machine Learning · Computer Science 2025-01-22 Marcin Blachnik , Piotr Ciepliński

Exploring the Learning Difficulty of Data Theory and Measure

As learning difficulty is crucial for machine learning (e.g., difficulty-based weighting learning strategies), previous literature has proposed a number of learning difficulty measures. However, no comprehensive investigation for learning…

Machine Learning · Computer Science 2022-09-20 Weiyao Zhu , Ou Wu , Fengguang Su , Yingjun Deng

Differences Between Hard and Noisy-labeled Samples: An Empirical Study

Extracting noisy or incorrectly labeled samples from a labeled dataset with hard/difficult samples is an important yet under-explored topic. Two general and often independent lines of work exist, one focuses on addressing noisy labels, and…

Machine Learning · Computer Science 2023-07-21 Mahsa Forouzesh , Patrick Thiran

Evolutionary Data Measures: Understanding the Difficulty of Text Classification Tasks

Classification tasks are usually analysed and improved through new model architectures or hyperparameter optimisation but the underlying properties of datasets are discovered on an ad-hoc basis as errors occur. However, understanding the…

Computation and Language · Computer Science 2018-12-10 Edward Collins , Nikolai Rozanov , Bingbing Zhang

Difficulty-Net: Learning to Predict Difficulty for Long-Tailed Recognition

Long-tailed datasets, where head classes comprise much more training samples than tail classes, cause recognition models to get biased towards the head classes. Weighted loss is one of the most popular ways of mitigating this issue, and a…

Computer Vision and Pattern Recognition · Computer Science 2022-09-08 Saptarshi Sinha , Hiroki Ohashi

Adversarial Examples and Metrics

Adversarial examples are a type of attack on machine learning (ML) systems which cause misclassification of inputs. Achieving robustness against adversarial examples is crucial to apply ML in the real world. While most prior work on…

Cryptography and Security · Computer Science 2020-07-16 Nico Döttling , Kathrin Grosse , Michael Backes , Ian Molloy

Beyond Hard Labels: Investigating data label distributions

High-quality data is a key aspect of modern machine learning. However, labels generated by humans suffer from issues like label noise and class ambiguities. We raise the question of whether hard labels are sufficient to represent the…

Computer Vision and Pattern Recognition · Computer Science 2022-10-07 Vasco Grossmann , Lars Schmarje , Reinhard Koch

Dataset Difficulty and the Role of Inductive Bias

Motivated by the goals of dataset pruning and defect identification, a growing body of methods have been developed to score individual examples within a dataset. These methods, which we call "example difficulty scores", are typically used…

Machine Learning · Computer Science 2024-01-04 Devin Kwok , Nikhil Anand , Jonathan Frankle , Gintare Karolina Dziugaite , David Rolnick

Deep Learning Through the Lens of Example Difficulty

Existing work on understanding deep learning often employs measures that compress all data-dependent information into a few numbers. In this work, we adopt a perspective based on the role of individual examples. We introduce a measure of…

Machine Learning · Computer Science 2021-06-21 Robert J. N. Baldock , Hartmut Maennel , Behnam Neyshabur

A Survey of Machine Learning Methods and Challenges for Windows Malware Classification

Malware classification is a difficult problem, to which machine learning methods have been applied for decades. Yet progress has often been slow, in part due to a number of unique difficulties with the task that occur through all stages of…

Cryptography and Security · Computer Science 2020-11-17 Edward Raff , Charles Nicholas

Certifying Data-Bias Robustness in Linear Regression

Datasets typically contain inaccuracies due to human error and societal biases, and these inaccuracies can affect the outcomes of models trained on such datasets. We present a technique for certifying whether linear regression models are…

Machine Learning · Computer Science 2022-06-09 Anna P. Meyer , Aws Albarghouthi , Loris D'Antoni

A Benchmark of Long-tailed Instance Segmentation with Noisy Labels

In this paper, we consider the instance segmentation task on a long-tailed dataset, which contains label noise, i.e., some of the annotations are incorrect. There are two main reasons making this case realistic. First, datasets collected…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Guanlin Li , Guowen Xu , Tianwei Zhang

Identifying Mislabeled Instances in Classification Datasets

A key requirement for supervised machine learning is labeled training data, which is created by annotating unlabeled data with the appropriate class. Because this process can in many cases not be done by machines, labeling needs to be…

Machine Learning · Computer Science 2019-12-12 Nicolas Michael Müller , Karla Markert

Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance

In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy, underpinning a wide array of applications from neural architecture search to hyperparameter optimization. However, the reliability…

Machine Learning · Computer Science 2024-09-24 Pawel Pukowski , Haiping Lu