Related papers: Data Quality for Software Vulnerability Datasets

Data Quality Issues in Vulnerability Detection Datasets

Vulnerability detection is a crucial yet challenging task to identify potential weaknesses in software for cyber security. Recently, deep learning (DL) has made great progress in automating the detection process. Due to the complex…

Cryptography and Security · Computer Science 2024-10-10 Yuejun Guo , Seifeddine Bettaieb

R+R: Security Vulnerability Dataset Quality Is Critical

Large Language Models (LLMs) are of great interest in vulnerability detection and repair. The effectiveness of these models hinges on the quality of the datasets used for both training and evaluation. Our investigation reveals that a number…

Software Engineering · Computer Science 2025-03-11 Anurag Swarnim Yadav , Joseph N. Wilson

How Data Quality Affects Machine Learning Models for Credit Risk Assessment

Machine Learning (ML) models are being increasingly employed for credit risk evaluation, with their effectiveness largely hinging on the quality of the input data. In this paper we investigate the impact of several data quality issues,…

Machine Learning · Computer Science 2025-11-18 Andrea Maurino

Noisy Label Learning for Security Defects

Data-driven software engineering processes, such as vulnerability prediction heavily rely on the quality of the data used. In this paper, we observe that it is infeasible to obtain a noise-free security defect dataset in practice. Despite…

Software Engineering · Computer Science 2022-04-04 Roland Croft , M. Ali Babar , Huaming Chen

Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation

Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location…

Software Engineering · Computer Science 2020-12-22 Michael F. Bosu , Stephen G. MacDonell

Data Quality in Empirical Software Engineering: A Targeted Review

Context: The utility of prediction models in empirical software engineering (ESE) is heavily reliant on the quality of the data used in building those models. Several data quality challenges such as noise, incompleteness, outliers and…

Software Engineering · Computer Science 2021-05-25 Michael Franklin Bosu , Stephen G. MacDonell

When Machine Learning Meets Vulnerability Discovery: Challenges and Lessons Learned

In recent years, machine learning has demonstrated impressive results in various fields, including software vulnerability detection. Nonetheless, using machine learning to identify software vulnerabilities presents new challenges,…

Cryptography and Security · Computer Science 2025-08-22 Sima Arasteh , Christophe Hauser

Deep Learning based Vulnerability Detection: Are We There Yet?

Automated detection of software vulnerabilities is a fundamental problem in software security. Existing program analysis techniques either suffer from high false positives or false negatives. Recent progress in Deep Learning (DL) has…

Software Engineering · Computer Science 2020-09-16 Saikat Chakraborty , Rahul Krishna , Yangruibo Ding , Baishakhi Ray

Statistical Dataset Evaluation: Reliability, Difficulty, and Validity

Datasets serve as crucial training resources and model performance trackers. However, existing datasets have exposed a plethora of problems, inducing biased models and unreliable evaluation results. In this paper, we propose a…

Computation and Language · Computer Science 2022-12-20 Chengwen Wang , Qingxiu Dong , Xiaochen Wang , Haitao Wang , Zhifang Sui

A Taxonomy of Data Quality Challenges in Empirical Software Engineering

Reliable empirical models such as those used in software effort estimation or defect prediction are inherently dependent on the data from which they are built. As demands for process and product improvement continue to grow, the quality of…

Software Engineering · Computer Science 2021-06-14 Michael Franklin Bosu , Stephen G. MacDonell

Automatic Data Labeling for Software Vulnerability Prediction Models: How Far Are We?

Background: Software Vulnerability (SV) prediction needs large-sized and high-quality data to perform well. Current SV datasets mostly require expensive labeling efforts by experts (human-labeled) and thus are limited in size. Meanwhile,…

Software Engineering · Computer Science 2024-07-26 Triet H. M. Le , M. Ali Babar

Benchmarking Software Vulnerability Detection Techniques: A Survey

Software vulnerabilities can have serious consequences, which is why many techniques have been proposed to defend against them. Among these, vulnerability detection techniques are a major area of focus. However, there is a lack of a…

Software Engineering · Computer Science 2023-03-30 Yingzhou Bi , Jiangtao Huang , Penghui Liu , Lianmei Wang

An extensive empirical study of inconsistent labels in multi-version-project defect data sets

The label quality of defect data sets has a direct influence on the reliability of defect prediction models. In this study, for multi-version-project defect data sets, we propose an approach to automatically detecting instances with…

Software Engineering · Computer Science 2021-01-29 Shiran Liu , Zhaoqiang Guo , Yanhui Li , Chuanqi Wang , Lin Chen , Zhongbin Sun , Yuming Zhou

Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering

Deep learning (DL) techniques have achieved significant success in various software engineering tasks (e.g., code completion by Copilot). However, DL systems are prone to bugs from many sources, including training data. Existing literature…

Software Engineering · Computer Science 2025-08-12 Mehil B Shah , Mohammad Masudur Rahman , Foutse Khomh

An Investigation into Inconsistency of Software Vulnerability Severity across Data Sources

Software Vulnerability (SV) severity assessment is a vital task for informing SV remediation and triage. Ranking of SV severity scores is often used to advise prioritization of patching efforts. However, severity assessment is a difficult…

Software Engineering · Computer Science 2022-01-19 Roland Croft , M. Ali Babar , Li Li

A Survey on Automated Software Vulnerability Detection Using Machine Learning and Deep Learning

Software vulnerability detection is critical in software security because it identifies potential bugs in software systems, enabling immediate remediation and mitigation measures to be implemented before they may be exploited. Automatic…

Software Engineering · Computer Science 2023-06-21 Nima Shiri Harzevili , Alvine Boaye Belle , Junjie Wang , Song Wang , Zhen Ming , Jiang , Nachiappan Nagappan

Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets

The impact of software vulnerabilities on everyday software systems is significant. Despite deep learning models being proposed for vulnerability detection, their reliability is questionable. Prior evaluations show high recall/F1 scores of…

Software Engineering · Computer Science 2024-07-04 Partha Chakraborty , Krishna Kanth Arumugam , Mahmoud Alfadel , Meiyappan Nagappan , Shane McIntosh

Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection

AI-based solutions demonstrate remarkable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investigate the…

Cryptography and Security · Computer Science 2025-10-08 Rijha Safdar , Danyail Mateen , Syed Taha Ali , M. Umer Ashfaq , Wajahat Hussain

Vulnerability Detection with Code Language Models: How Far Are We?

In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing…

Software Engineering · Computer Science 2024-07-11 Yangruibo Ding , Yanjun Fu , Omniyyah Ibrahim , Chawin Sitawarin , Xinyun Chen , Basel Alomair , David Wagner , Baishakhi Ray , Yizheng Chen

An Empirical Study of the Imbalance Issue in Software Vulnerability Detection

Vulnerability detection is crucial to protect software security. Nowadays, deep learning (DL) is the most promising technique to automate this detection task, leveraging its superior ability to extract patterns and representations within…

Software Engineering · Computer Science 2026-02-13 Yuejun Guo , Qiang Hu , Qiang Tang , Yves Le Traon