Related papers: Using Document Similarity Methods to create Parall…

Leveraging Automated Unit Tests for Unsupervised Code Translation

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method…

Software Engineering · Computer Science 2022-02-17 Baptiste Roziere , Jie M. Zhang , Francois Charton , Mark Harman , Gabriel Synnaeve , Guillaume Lample

CoDesc: A Large Code-Description Parallel Dataset

Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the…

Computation and Language · Computer Science 2021-06-01 Masum Hasan , Tanveer Muttaqueen , Abdullah Al Ishtiaq , Kazi Sajeed Mehrab , Md. Mahim Anjum Haque , Tahmid Hasan , Wasi Uddin Ahmad , Anindya Iqbal , Rifat Shahriyar

Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences

Training code-switched language models is difficult due to lack of data and complexity in the grammatical structure. Linguistic constraint theories have been used for decades to generate artificial code-switching sentences to cope with this…

Computation and Language · Computer Science 2019-09-19 Genta Indra Winata , Andrea Madotto , Chien-Sheng Wu , Pascale Fung

How to Learn in a Noisy World? Self-Correcting the Real-World Data Noise in Machine Translation

The massive amounts of web-mined parallel data contain large amounts of noise. Semantic misalignment, as the primary source of the noise, poses a challenge for training machine translation systems. In this paper, we first introduce a…

Computation and Language · Computer Science 2025-02-10 Yan Meng , Di Wu , Christof Monz

Constructing Multilingual Code Search Dataset Using Neural Machine Translation

Code search is a task to find programming codes that semantically match the given natural language queries. Even though some of the existing datasets for this task are multilingual on the programming language side, their query data are only…

Computation and Language · Computer Science 2023-06-28 Ryo Sekizawa , Nan Duan , Shuai Lu , Hitomi Yanaka

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code…

Computation and Language · Computer Science 2024-10-07 Yiqing Xie , Atharva Naik , Daniel Fried , Carolyn Rose

Neural Machine Translation for Code Generation

Neural machine translation (NMT) methods developed for natural language processing have been shown to be highly successful in automating translation from one natural language to another. Recently, these NMT methods have been adapted to the…

Computation and Language · Computer Science 2023-05-24 Dharma KC , Clayton T. Morrison

Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel Data

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages;…

Computation and Language · Computer Science 2021-06-03 Wei-Jen Ko , Ahmed El-Kishky , Adithya Renduchintala , Vishrav Chaudhary , Naman Goyal , Francisco Guzmán , Pascale Fung , Philipp Koehn , Mona Diab

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of…

Computation and Language · Computer Science 2017-07-10 Antonio Valerio Miceli Barone , Rico Sennrich

Unsupervised Translation of Programming Languages

A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port…

Computation and Language · Computer Science 2020-09-23 Marie-Anne Lachaux , Baptiste Roziere , Lowik Chanussot , Guillaume Lample

A Neural Model for Generating Natural Language Summaries of Program Subroutines

Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance.…

Software Engineering · Computer Science 2019-02-07 Alexander LeClair , Siyuan Jiang , Collin McMillan

Quality Estimation & Interpretability for Code Translation

Recently, the automated translation of source code from one programming language to another by using automatic approaches inspired by Neural Machine Translation (NMT) methods for natural languages has come under study. However, such…

Software Engineering · Computer Science 2021-04-28 Mayank Agarwal , Kartik Talamadupula , Stephanie Houde , Fernando Martinez , Michael Muller , John Richards , Steven Ross , Justin D. Weisz

Program Translation via Code Distillation

Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for…

Software Engineering · Computer Science 2023-10-19 Yufan Huang , Mengnan Qi , Yongqiang Yao , Maoquan Wang , Bin Gu , Colin Clement , Neel Sundaresan

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from…

Computation and Language · Computer Science 2016-03-23 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

Summarize and Generate to Back-translate: Unsupervised Translation of Programming Languages

Back-translation is widely known for its effectiveness in neural machine translation when there is little to no parallel data. In this approach, a source-to-target model is coupled with a target-to-source model trained in parallel. The…

Computation and Language · Computer Science 2023-02-14 Wasi Uddin Ahmad , Saikat Chakraborty , Baishakhi Ray , Kai-Wei Chang

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Despite the recent trend of developing and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used…

Software Engineering · Computer Science 2022-10-04 Anh T. V. Dau , Thang Nguyen-Duc , Hoang Thanh-Tung , Nghi D. Q. Bui

Translating away Translationese without Parallel Data

Translated texts exhibit systematic linguistic differences compared to original texts in the same language, and these differences are referred to as translationese. Translationese has effects on various cross-lingual natural language…

Computation and Language · Computer Science 2023-10-31 Rricha Jalota , Koel Dutta Chowdhury , Cristina España-Bonet , Josef van Genabith

Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of…

Software Engineering · Computer Science 2022-04-05 Sakib Haque , Zachary Eberhart , Aakash Bansal , Collin McMillan

A Multi-Perspective Architecture for Semantic Code Search

The ability to match pieces of code to their corresponding natural language descriptions and vice versa is fundamental for natural language search interfaces to software repositories. In this paper, we propose a novel multi-perspective…

Software Engineering · Computer Science 2024-04-12 Rajarshi Haldar , Lingfei Wu , Jinjun Xiong , Julia Hockenmaier

Building a Neural Machine Translation System Using Only Synthetic Parallel Data

Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data.…

Computation and Language · Computer Science 2017-09-19 Jaehong Park , Jongyoon Song , Sungroh Yoon