A Language-Agnostic Model for Semantic Source Code Labeling

Ben Gelman; Bryan Hoyle; Jessica Moore; Joshua Saxe; David Slater

doi:10.1145/3243127.3243132

A Language-Agnostic Model for Semantic Source Code Labeling

Machine Learning 2019-06-05 v1 Computation and Language Software Engineering Machine Learning

Authors: Ben Gelman , Bryan Hoyle , Jessica Moore , Joshua Saxe , David Slater

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

Keywords

code generation software refactoring semi-supervised learning

Cite

@article{arxiv.1906.01032,
  title  = {A Language-Agnostic Model for Semantic Source Code Labeling},
  author = {Ben Gelman and Bryan Hoyle and Jessica Moore and Joshua Saxe and David Slater},
  journal= {arXiv preprint arXiv:1906.01032},
  year   = {2019}
}

Comments

MASES 2018 Publication

A Language-Agnostic Model for Semantic Source Code Labeling

Abstract

Keywords

Cite

Comments

Related papers