English

Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size

Software Engineering 2021-08-11 v1

Abstract

This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit comprehension, fault localization, automated program repair, and machine learning on code changes.

Cite

@article{arxiv.2108.04631,
  title  = {Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size},
  author = {Martin Monperrus and Matias Martinez and He Ye and Fernanda Madeiral and Thomas Durieux and Zhongxing Yu},
  journal= {arXiv preprint arXiv:2108.04631},
  year   = {2021}
}