Related papers: CoDesc: A Large Code-Description Parallel Dataset

Recommendations for Datasets for Source Code Summarization

Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code…

Computation and Language · Computer Science 2019-04-05 Alexander LeClair , Collin McMillan

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis…

Computation and Language · Computer Science 2021-10-12 Mayank Agarwal , Kartik Talamadupula , Fernando Martinez , Stephanie Houde , Michael Muller , John Richards , Steven I Ross , Justin D. Weisz

Constructing Multilingual Code Search Dataset Using Neural Machine Translation

Code search is a task to find programming codes that semantically match the given natural language queries. Even though some of the existing datasets for this task are multilingual on the programming language side, their query data are only…

Computation and Language · Computer Science 2023-06-28 Ryo Sekizawa , Nan Duan , Shuai Lu , Hitomi Yanaka

CodeDSI: Differentiable Code Search

Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using…

Software Engineering · Computer Science 2022-10-04 Usama Nadeem , Noah Ziems , Shaoen Wu

Code2Doc: A Quality-First Curated Dataset for Code Documentation

The performance of automatic code documentation generation models depends critically on the quality of the training data used for supervision. However, most existing code documentation datasets are constructed through large scale scraping…

Software Engineering · Computer Science 2025-12-25 Recep Kaan Karaman , Meftun Akarsu

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly…

Machine Learning · Computer Science 2020-06-09 Hamel Husain , Ho-Hsiang Wu , Tiferet Gazit , Miltiadis Allamanis , Marc Brockschmidt

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase…

Software Engineering · Computer Science 2021-08-31 Ruchir Puri , David S. Kung , Geert Janssen , Wei Zhang , Giacomo Domeniconi , Vladimir Zolotov , Julian Dolby , Jie Chen , Mihir Choudhury , Lindsey Decker , Veronika Thost , Luca Buratti , Saurabh Pujar , Shyam Ramji , Ulrich Finkler , Susan Malaika , Frederick Reiss

Neural Code Search Evaluation Dataset

There has been an increase of interest in code search using natural language. Assessing the performance of such code search models can be difficult without a readily available evaluation suite. In this paper, we present an evaluation…

Software Engineering · Computer Science 2019-10-03 Hongyu Li , Seohyun Kim , Satish Chandra

CodeSum: Translate Program Language to Natural Language

During software maintenance, programmers spend a lot of time on code comprehension. Reading comments is an effective way for programmers to reduce the reading and navigating time when comprehending source code. Therefore, as a critical task…

Software Engineering · Computer Science 2018-02-01 Xing Hu , Yuhan Wei , Ge Li , Zhi Jin

On the Importance of Building High-quality Training Datasets for Neural Code Search

The performance of neural code search is significantly influenced by the quality of the training data from which the neural models are derived. A large corpus of high-quality query and code pairs is demanded to establish a precise mapping…

Software Engineering · Computer Science 2022-02-15 Zhensu Sun , Li Li , Yan Liu , Xiaoning Du , Li Li

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich…

Software Engineering · Computer Science 2022-06-20 Ming Zhu , Aneesh Jain , Karthik Suresh , Roshan Ravindran , Sindhu Tipirneni , Chandan K. Reddy

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

While there has been a recent burgeoning of applications at the intersection of natural and programming languages, such as code generation and code summarization, these applications are usually English-centric. This creates a barrier for…

Computation and Language · Computer Science 2023-02-08 Zhiruo Wang , Grace Cuenca , Shuyan Zhou , Frank F. Xu , Graham Neubig

Source Code Retrieval Using Sequence Based Similarity

Duplicated code has a negative impact on the quality of software systems and should be detected at least. In this paper, we discuss an approach that improves source code retrieval using the structural information about the programs. We…

Software Engineering · Computer Science 2013-08-19 Yoshihisa Udagawa

CoSQA: 20,000+ Web Queries for Code Search and Question Answering

Finding codes given natural language query isb eneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we…

Computation and Language · Computer Science 2021-05-28 Junjie Huang , Duyu Tang , Linjun Shou , Ming Gong , Ke Xu , Daxin Jiang , Ming Zhou , Nan Duan

Cross-Language Code Search using Static and Dynamic Analyses

As code search permeates most activities in software development,code-to-code search has emerged to support using code as a query and retrieving similar code in the search results. Applications include duplicate code detection for…

Software Engineering · Computer Science 2021-06-18 George Mathew , Kathryn T. Stolee

GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code…

Computation and Language · Computer Science 2023-11-17 Andor Diera , Abdelhalim Dahou , Lukas Galke , Fabian Karl , Florian Sihler , Ansgar Scherp

COSET: A Benchmark for Evaluating Neural Program Embeddings

Neural program embedding can be helpful in analyzing large software, a task that is challenging for traditional logic-based program analyses due to their limited scalability. A key focus of recent machine-learning advances in this area is…

Machine Learning · Computer Science 2019-05-29 Ke Wang , Mihai Christodorescu

CodeShell Technical Report

Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In…

Software Engineering · Computer Science 2024-03-26 Rui Xie , Zhengran Zeng , Zhuohao Yu , Chang Gao , Shikun Zhang , Wei Ye

Synchromesh: Reliable code generation from pre-trained language models

Large pre-trained language models have been used to generate code,providing a flexible interface for synthesizing programs from natural language specifications. However, they often violate syntactic and semantic rules of their output…

Machine Learning · Computer Science 2022-01-28 Gabriel Poesia , Oleksandr Polozov , Vu Le , Ashish Tiwari , Gustavo Soares , Christopher Meek , Sumit Gulwani

Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…

Software Engineering · Computer Science 2020-08-19 Aditya Kanade , Petros Maniatis , Gogul Balakrishnan , Kensen Shi