Clone Detection on Large Scala Codebases

Wahidur Rahman; Yisen Xu; Fan Pu; Jifeng Xuan; Xiangyang Jia; Michail Basios; Leslie Kanthan; Lingbo Li; Fan Wu; Baowen Xu

Clone Detection on Large Scala Codebases

Software Engineering 2022-04-12 v1

Authors: Wahidur Rahman , Yisen Xu , Fan Pu , Jifeng Xuan , Xiangyang Jia , Michail Basios , Leslie Kanthan , Lingbo Li , Fan Wu , Baowen Xu

View on arXiv ↗ PDF ↗

Abstract

Code clones are identical or similar code segments. The wide existence of code clones can increase the cost of maintenance and jeopardise the quality of software. The research community has developed many techniques to detect code clones, however, there is little evidence of how these techniques may perform in industrial use cases. In this paper, we aim to uncover the differences when such techniques are applied in industrial use cases. We conducted large scale experimental research on the performance of two state-of-the-art code clone detection techniques, SourcererCC and AutoenCODE, on both open source projects and an industrial project written in the Scala language. Our results reveal that both algorithms perform differently on the industrial project, with the largest drop in precision being 30.7\%, and the largest increase in recall being 32.4\%. By manually labelling samples of the industrial project by its developers, we discovered that there are substantially less Type-3 clones in the aforementioned project than that in the open source projects.

Keywords

software refactoring code generation binary analysis

Cite

@article{arxiv.2204.04247,
  title  = {Clone Detection on Large Scala Codebases},
  author = {Wahidur Rahman and Yisen Xu and Fan Pu and Jifeng Xuan and Xiangyang Jia and Michail Basios and Leslie Kanthan and Lingbo Li and Fan Wu and Baowen Xu},
  journal= {arXiv preprint arXiv:2204.04247},
  year   = {2022}
}

Comments

Presented at IWSC SANER 2020

Clone Detection on Large Scala Codebases

Abstract

Keywords

Cite

Comments

Related papers