English

Defect Prediction with Content-based Features

Software Engineering 2024-09-30 v1 Computation and Language Machine Learning

Abstract

Traditional defect prediction approaches often use metrics that measure the complexity of the design or implementing code of a software system, such as the number of lines of code in a source file. In this paper, we explore a different approach based on content of source code. Our key assumption is that source code of a software system contains information about its technical aspects and those aspects might have different levels of defect-proneness. Thus, content-based features such as words, topics, data types, and package names extracted from a source code file could be used to predict its defects. We have performed an extensive empirical evaluation and found that: i) such content-based features have higher predictive power than code complexity metrics and ii) the use of feature selection, reduction, and combination further improves the prediction performance.

Keywords

Cite

@article{arxiv.2409.18365,
  title  = {Defect Prediction with Content-based Features},
  author = {Hung Viet Pham and Tung Thanh Nguyen},
  journal= {arXiv preprint arXiv:2409.18365},
  year   = {2024}
}