Related papers: Exploring Data Augmentation for Code Generation Ta…

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could…

Computation and Language · Computer Science 2025-02-03 Yaping Chai , Haoran Xie , Joe S. Qin

Empowering Large Language Models for Textual Data Augmentation

With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on…

Computation and Language · Computer Science 2024-04-30 Yichuan Li , Kaize Ding , Jianling Wang , Kyumin Lee

Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study

Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires…

Software Engineering · Computer Science 2025-02-07 Zeming Dong , Qiang Hu , Yuejun Guo , Zhenya Zhang , Maxime Cordy , Mike Papadakis , Yves Le Traon , Jianjun Zhao

Leveraging Data Augmentation for Process Information Extraction

Business Process Modeling projects often require formal process models as a central component. High costs associated with the creation of such formal process models motivated many different fields of research aimed at automated generation…

Computation and Language · Computer Science 2024-04-12 Julian Neuberger , Leonie Doll , Benedict Engelmann , Lars Ackermann , Stefan Jablonski

Code Needs Comments: Enhancing Code LLMs with Comment Augmentation

The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data…

Computation and Language · Computer Science 2024-02-21 Demin Song , Honglin Guo , Yunhua Zhou , Shuhao Xing , Yudong Wang , Zifan Song , Wenwei Zhang , Qipeng Guo , Hang Yan , Xipeng Qiu , Dahua Lin

How Does Code Pretraining Affect Language Model Task Performance?

Large language models are increasingly trained on corpora containing both natural language and non-linguistic data like source code. Aside from aiding programming-related tasks, anecdotal evidence suggests that including code in pretraining…

Computation and Language · Computer Science 2025-02-26 Jackson Petty , Sjoerd van Steenkiste , Tal Linzen

A Survey on Data Augmentation for Text Classification

Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing a model's generalization…

Computation and Language · Computer Science 2022-09-09 Markus Bayer , Marc-André Kaufhold , Christian Reuter

Data Augmentation for Text Generation Without Any Augmented Data

Data augmentation is an effective way to improve the performance of many neural text generation models. However, current data augmentation methods need to define or choose proper data mapping functions that map the original samples into the…

Computation and Language · Computer Science 2021-05-31 Wei Bi , Huayang Li , Jiacheng Huang

Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning

Fine-tuning pre-trained generative language models to down-stream language generation tasks has shown promising results. However, this comes with the cost of having a single, large model for each task, which is not ideal in low-memory/power…

Computation and Language · Computer Science 2020-09-22 Zhaojiang Lin , Andrea Madotto , Pascale Fung

Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text Classifiers

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve…

Computation and Language · Computer Science 2022-07-25 Markus Bayer , Marc-André Kaufhold , Björn Buchhold , Marcel Keller , Jörg Dallmeyer , Christian Reuter

Rethink the Effectiveness of Text Data Augmentation: An Empirical Analysis

In recent years, language models (LMs) have made remarkable progress in advancing the field of natural language processing (NLP). However, the impact of data augmentation (DA) techniques on the fine-tuning (FT) performance of these LMs has…

Computation and Language · Computer Science 2023-06-14 Zhengxiang Shi , Aldo Lipani

Rethinking Data Augmentation for Low-Resource Neural Machine Translation: A Multi-Task Learning Approach

In the context of neural machine translation, data augmentation (DA) techniques may be used for generating additional training samples when the available parallel data are scarce. Many DA approaches aim at expanding the support of the…

Computation and Language · Computer Science 2021-09-09 Víctor M. Sánchez-Cartagena , Miquel Esplà-Gomis , Juan Antonio Pérez-Ortiz , Felipe Sánchez-Martínez

Data Augmentation and Hyperparameter Tuning for Low-Resource MFA

A continued issue for those working with computational tools and endangered and under-resourced languages is the lower accuracy of results for languages with smaller amounts of data. We attempt to ameliorate this issue by using data…

Computation and Language · Computer Science 2025-04-10 Alessio Tosolini , Claire Bowern

Make More of Your Data: Minimal Effort Data Augmentation for Automatic Speech Recognition and Translation

Data augmentation is a technique to generate new training data based on existing data. We evaluate the simple and cost-effective method of concatenating the original data examples to build new training instances. Continued training with…

Computation and Language · Computer Science 2023-06-12 Tsz Kin Lam , Shigehiko Schamoni , Stefan Riezler

Data Augmentation for Code Translation with Comparable Corpora and Multiple References

One major challenge of translating code between programming languages is that parallel training data is often limited. To overcome this challenge, we present two data augmentation techniques, one that builds comparable corpora (i.e., code…

Computation and Language · Computer Science 2024-10-07 Yiqing Xie , Atharva Naik , Daniel Fried , Carolyn Rose

Code-Switching for Enhancing NMT with Pre-Specified Translation

Leveraging user-provided translation to constrain NMT has practical significance. Existing methods can be classified into two main categories, namely the use of placeholder tags for lexicon words and the use of hard constraints during…

Computation and Language · Computer Science 2019-05-17 Kai Song , Yue Zhang , Heng Yu , Weihua Luo , Kun Wang , Min Zhang

A Survey on Data Augmentation in Large Model Era

Large models, encompassing large language and diffusion models, have shown exceptional promise in approximating human-level intelligence, garnering significant interest from both academic and industrial spheres. However, the training of…

Machine Learning · Computer Science 2024-03-05 Yue Zhou , Chenlu Guo , Xu Wang , Yi Chang , Yuan Wu

Training Bilingual LMs with Data Constraints in the Targeted Language

Large language models are trained on massive scrapes of the web, as required by current scaling laws. Most progress is made for English, given its abundance of high-quality pretraining data. For most other languages, however, such high…

Computation and Language · Computer Science 2025-02-07 Skyler Seto , Maartje ter Hoeve , Richard He Bai , Natalie Schluter , David Grangier

DAGAM: Data Augmentation with Generation And Modification

Text classification is a representative downstream task of natural language processing, and has exhibited excellent performance since the advent of pre-trained language models based on Transformer architecture. However, in pre-trained…

Computation and Language · Computer Science 2022-04-07 Byeong-Cheol Jo , Tak-Sung Heo , Yeongjoon Park , Yongmin Yoo , Won Ik Cho , Kyungsun Kim

Back Translation Survey for Improving Text Augmentation

Natural Language Processing (NLP) relies heavily on training data. Transformers, as they have gotten bigger, have required massive amounts of training data. To satisfy this requirement, text augmentation should be looked at as a way to…

Computation and Language · Computer Science 2022-11-17 Matthew Ciolino , David Noever , Josh Kalin