Related papers: Data Augmentation for Code Translation with Compar…

Automated Snippet-Alignment Data Augmentation for Code Translation

Code translation aims to translate the code from its source language to the target language and is used in various software development scenarios. Recent developments in Large Language Models (LLMs) have showcased their capabilities in code…

Software Engineering · Computer Science 2025-10-20 Zhiming Zhang , Qingfu Zhu , Xianzhen Luo , Yixuan Wang , Bohan Li , Wanxiang Che

Data Augmentation for Neural Machine Translation using Generative Language Model

Despite the rapid growth in model architecture, the scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation. Data augmentation is a technique that enhances the performance of data-hungry models by…

Computation and Language · Computer Science 2023-11-14 Seokjin Oh , Su Ah Lee , Woohwan Jung

Multi-Source Neural Machine Translation with Data Augmentation

Multi-source translation systems translate from multiple languages to a single target language. By using information from these multiple sources, these systems achieve large gains in accuracy. To train these systems, it is necessary to have…

Computation and Language · Computer Science 2018-11-09 Yuta Nishimura , Katsuhito Sudoh , Graham Neubig , Satoshi Nakamura

Exploring Data Augmentation for Code Generation Tasks

Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and…

Computation and Language · Computer Science 2023-02-08 Pinzhen Chen , Gerasimos Lampouras

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Translating source code from one programming language to another is a critical, time-consuming task in modernizing legacy applications and codebases. Recent work in this space has drawn inspiration from the software naturalness hypothesis…

Computation and Language · Computer Science 2021-10-12 Mayank Agarwal , Kartik Talamadupula , Fernando Martinez , Stephanie Houde , Michael Muller , John Richards , Steven I Ross , Justin D. Weisz

Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language Modeling

Building large-scale datasets for training code-switching language models is challenging and very expensive. To alleviate this problem using parallel corpus has been a major workaround. However, existing solutions use linguistic constraints…

Computation and Language · Computer Science 2018-10-31 Genta Indra Winata , Andrea Madotto , Chien-Sheng Wu , Pascale Fung

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study

Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this…

Computation and Language · Computer Science 2023-10-25 Injy Hamed , Nizar Habash , Ngoc Thang Vu

Neural Machine Translation Data Generation and Augmentation using ChatGPT

Neural models have revolutionized the field of machine translation, but creating parallel corpora is expensive and time-consuming. We investigate an alternative to manual parallel corpora - hallucinated parallel corpora created by…

Computation and Language · Computer Science 2023-07-13 Wayne Yang , Garrett Nicolai

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they constitute a very useful research material with a wide range of applications. This study presents and analyses new methodologies we developed for obtaining such data from…

Computation and Language · Computer Science 2016-03-23 Krzysztof Wołk , Emilia Rejmund , Krzysztof Marasek

Leveraging Automated Unit Tests for Unsupervised Code Translation

With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method…

Software Engineering · Computer Science 2022-02-17 Baptiste Roziere , Jie M. Zhang , Francois Charton , Mark Harman , Gabriel Synnaeve , Guillaume Lample

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

The multilingual nature of the world makes translation a crucial requirement today. Parallel dictionaries constructed by humans are a widely-available resource, but they are limited and do not provide enough coverage for good quality…

Computation and Language · Computer Science 2015-12-08 Krzysztof Wołk , Krzysztof Marasek

Parallel Data Augmentation for Formality Style Transfer

The main barrier to progress in the task of Formality Style Transfer is the inadequacy of training data. In this paper, we study how to augment parallel data and propose novel and simple data augmentation methods for this task to obtain…

Computation and Language · Computer Science 2020-05-18 Yi Zhang , Tao Ge , Xu Sun

Data Augmentation and Hyperparameter Tuning for Low-Resource MFA

A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data…

Computation and Language · Computer Science 2025-04-10 Alessio Tosolini , Claire Bowern

Beyond Code Pairs: Dialogue-Based Data Generation for LLM Code Translation

Large language models (LLMs) have shown remarkable capabilities in code translation, yet their performance deteriorates in low-resource programming domains such as Fortran and emerging frameworks like CUDA, where high-quality parallel data…

Programming Languages · Computer Science 2025-12-04 Le Chen , Nuo Xu , Winson Chen , Bin Lei , Pei-Hung Lin , Dunzhi Zhou , Rajeev Thakur , Caiwen Ding , Ali Jannesari , Chunhua Liao

Original or Translated? On the Use of Parallel Data for Translation Quality Estimation

Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled…

Computation and Language · Computer Science 2022-12-21 Baopu Qiu , Liang Ding , Di Wu , Lin Shang , Yibing Zhan , Dacheng Tao

SITTA: Single Image Texture Translation for Data Augmentation

Recent advances in data augmentation enable one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with…

Computer Vision and Pattern Recognition · Computer Science 2023-01-18 Boyi Li , Yin Cui , Tsung-Yi Lin , Serge Belongie

Program Translation via Code Distillation

Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for…

Software Engineering · Computer Science 2023-10-19 Yufan Huang , Mengnan Qi , Yongqiang Yao , Maoquan Wang , Bin Gu , Colin Clement , Neel Sundaresan

Sentence Concatenation Approach to Data Augmentation for Neural Machine Translation

Neural machine translation (NMT) has recently gained widespread attention because of its high translation accuracy. However, it shows poor performance in the translation of long sentences, which is a major issue in low-resource languages.…

Computation and Language · Computer Science 2021-04-20 Seiichiro Kondo , Kengo Hotate , Masahiro Kaneko , Mamoru Komachi

Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach

In the context of neural machine translation, data augmentation (DA) techniques may be used for generating additional training samples when the available parallel data are scarce. Many DA approaches aim at expanding the support of the…

Computation and Language · Computer Science 2021-09-09 Víctor M. Sánchez-Cartagena , Miquel Esplà-Gomis , Juan Antonio Pérez-Ortiz , Felipe Sánchez-Martínez

CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding

Data augmentation has been demonstrated as an effective strategy for improving model generalization and data efficiency. However, due to the discrete nature of natural language, designing label-preserving transformations for text data tends…

Computation and Language · Computer Science 2020-10-20 Yanru Qu , Dinghan Shen , Yelong Shen , Sandra Sajeev , Jiawei Han , Weizhu Chen