Related papers: A parallel corpus of Python functions and document…

DocGen: Generating Detailed Parameter Docstrings in Python

Documentation debt hinders the effective utilization of open-source software. Although code summarization tools have been helpful for developers, most would prefer a detailed account of each parameter in a function rather than a high-level…

Software Engineering · Computer Science 2023-11-21 Vatsal Venkatkrishna , Durga Shree Nagabushanam , Emmanuel Iko-Ojo Simon , Melina Vidoni

DocPrompting: Generating Code by Retrieving the Docs

Publicly available source-code libraries are continuously growing and changing. This makes it impossible for models of code to keep current with all available APIs by simply training these models on existing code repositories. Thus,…

Computation and Language · Computer Science 2023-02-21 Shuyan Zhou , Uri Alon , Frank F. Xu , Zhiruo Wang , Zhengbao Jiang , Graham Neubig

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis…

Computation and Language · Computer Science 2021-10-12 Mayank Agarwal , Kartik Talamadupula , Fernando Martinez , Stephanie Houde , Michael Muller , John Richards , Steven I Ross , Justin D. Weisz

A Syntactic Neural Model for General-Purpose Code Generation

We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without…

Computation and Language · Computer Science 2017-04-07 Pengcheng Yin , Graham Neubig

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly…

Machine Learning · Computer Science 2020-06-09 Hamel Husain , Ho-Hsiang Wu , Tiferet Gazit , Miltiadis Allamanis , Marc Brockschmidt

Auto-Documenation for Software Development

Software documentation is an essential but labor intensive task that often requires a dedicated team of developers to ensure coverage and accuracy. Good documentation will help shorten the development cycle and improve the overall team…

Software Engineering · Computer Science 2017-01-31 Thomas Zheng , Jeff Shaw , Sergey Kozlov

Generating Multilingual Parallel Corpus Using Subtitles

Neural Machine Translation with its significant results, still has a great problem: lack or absence of parallel corpus for many languages. This article suggests a method for generating considerable amount of parallel corpus for any language…

Computation and Language · Computer Science 2018-04-12 Farshad Jafari

Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning

Code summarization generates brief natural language description given a source code snippet, while code retrieval fetches relevant source code given a natural language query. Since both tasks aim to model the association between natural…

Information Retrieval · Computer Science 2020-02-26 Wei Ye , Rui Xie , Jinglei Zhang , Tianxiang Hu , Xiaoyin Wang , Shikun Zhang

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from…

Computation and Language · Computer Science 2016-03-23 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot

Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step…

Software Engineering · Computer Science 2023-02-02 Antonio Mastropaolo , Luca Pascarella , Emanuela Guglielmi , Matteo Ciniselli , Simone Scalabrino , Rocco Oliveto , Gabriele Bavota

PyMT5: multi-mode translation of natural language and Python code with transformers

Simultaneously modeling source code and natural language has many exciting applications in automated software development and understanding. Pursuant to achieving such technology, we introduce PyMT5, the Python method text-to-text transfer…

Machine Learning · Computer Science 2020-10-08 Colin B. Clement , Dawn Drain , Jonathan Timcheck , Alexey Svyatkovskiy , Neel Sundaresan

Natural Language-Guided Programming

In today's software world with its cornucopia of reusable software libraries, when a programmer is faced with a programming task that they suspect can be completed through the use of a library, they often look for code examples using a…

Software Engineering · Computer Science 2021-10-08 Geert Heyman , Rafael Huysegems , Pascal Justen , Tom Van Cutsem

CodeExp: Explanatory Code Document Generation

Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code…

Computation and Language · Computer Science 2022-11-29 Haotian Cui , Chenglong Wang , Junjie Huang , Jeevana Priya Inala , Todd Mytkowicz , Bo Wang , Jianfeng Gao , Nan Duan

Free and Customizable Code Documentation with LLMs: A Fine-Tuning Approach

Automated documentation of programming source code is a challenging task with significant practical and scientific implications for the developer community. We present a large language model (LLM)-based application that developers can use…

Software Engineering · Computer Science 2025-12-17 Sayak Chakrabarty , Souradip Pal

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code…

Computation and Language · Computer Science 2024-10-07 Yiqing Xie , Atharva Naik , Daniel Fried , Carolyn Rose

A Neural Model for Generating Natural Language Summaries of Program Subroutines

Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance.…

Software Engineering · Computer Science 2019-02-07 Alexander LeClair , Siyuan Jiang , Collin McMillan

Simplifying Parallelization of Scientific Codes by a Function-Centric Approach in Python

The purpose of this paper is to show how existing scientific software can be parallelized using a separate thin layer of Python code where all parallel communication is implemented. We provide specific examples on such layers of code, and…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-05-18 Jon K. Nilsen , Xing Cai , Bjorn Hoyland , Hans Petter Langtangen

Automatic Code Generation using Pre-Trained Language Models

Recent advancements in natural language processing \cite{gpt2} \cite{BERT} have led to near-human performance in multiple natural language tasks. In this paper, we seek to understand whether similar techniques can be applied to a highly…

Computation and Language · Computer Science 2021-02-23 Luis Perez , Lizi Ottens , Sudharshan Viswanathan

Challenges in Data-to-Document Generation

Recent neural models have shown significant progress on the problem of generating short descriptive texts conditioned on a small number of database records. In this work, we suggest a slightly more difficult data-to-text generation task,…

Computation and Language · Computer Science 2017-07-26 Sam Wiseman , Stuart M. Shieber , Alexander M. Rush

Learning Semantic Correspondences in Technical Documentation

We consider the problem of translating high-level textual descriptions to formal representations in technical documentation as part of an effort to model the meaning of such documentation. We focus specifically on the problem of learning…

Computation and Language · Computer Science 2017-09-18 Kyle Richardson , Jonas Kuhn