Related papers: GenCode: A Generic Data Augmentation Framework for…

Better Language Models of Code through Self-Improvement

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is…

Computation and Language · Computer Science 2023-05-11 Hung Quoc To , Nghi D. Q. Bui , Jin Guo , Tien N. Nguyen

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

ReCode: Updating Code API Knowledge with Reinforcement Learning

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their…

Computation and Language · Computer Science 2025-11-25 Haoze Wu , Yunzhi Yao , Wenhao Yu , Ningyu Zhang

MIXCODE: Enhancing Code Classification by Mixup-Based Data Augmentation

Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due…

Software Engineering · Computer Science 2023-01-11 Zeming Dong , Qiang Hu , Yuejun Guo , Maxime Cordy , Mike Papadakis , Zhenya Zhang , Yves Le Traon , Jianjun Zhao

Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing…

Computation and Language · Computer Science 2026-03-10 Zongqian Li , Tengchao Lv , Shaohan Huang , Yixuan Su , Qinzheng Sun , Qiufeng Yin , Ying Xin , Scarlett Li , Lei Cui , Nigel Collier , Furu Wei

UniCode: Augmenting Evaluation for Code Reasoning

Current coding benchmarks often inflate Large Language Model (LLM) capabilities due to static paradigms and data contamination, enabling models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce…

Software Engineering · Computer Science 2026-02-17 Xinyue Zheng , Haowei Lin , Shaofei Cai , Zilong Zheng , Yaodong Yang , Yitao Liang

Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study

Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires…

Software Engineering · Computer Science 2025-02-07 Zeming Dong , Qiang Hu , Yuejun Guo , Zhenya Zhang , Maxime Cordy , Mike Papadakis , Yves Le Traon , Jianjun Zhao

ReCode: Improving LLM-based Code Repair with Fine-Grained Retrieval-Augmented Generation

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in code-related tasks, such as code generation and automated program repair. Despite their promising performance, most existing approaches for code…

Software Engineering · Computer Science 2025-09-03 Yicong Zhao , Shisong Chen , Jiacheng Zhang , Zhixu Li

GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing

Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps.…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Khawar Islam , Muhammad Zaigham Zaheer , Arif Mahmood , Karthik Nandakumar , Naveed Akhtar

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

Sequential Data Augmentation for Generative Recommendation

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the…

Machine Learning · Computer Science 2026-05-21 Geon Lee , Bhuvesh Kumar , Clark Mingxuan Ju , Tong Zhao , Kijung Shin , Neil Shah , Liam Collins

Source Code Data Augmentation for Deep Learning: A Survey

The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and…

Computation and Language · Computer Science 2023-11-14 Terry Yue Zhuo , Zhou Yang , Zhensu Sun , Yufei Wang , Li Li , Xiaoning Du , Zhenchang Xing , David Lo

RETROcode: Leveraging a Code Database for Improved Natural Language to Code Generation

As text and code resources have expanded, large-scale pre-trained models have shown promising capabilities in code generation tasks, typically employing supervised fine-tuning with problem statement-program pairs. However, increasing model…

Computation and Language · Computer Science 2025-04-10 Nathanaël Beau , Benoît Crabbé

Strengthening Programming Comprehension in Large Language Models through Code Generation

Large language models (LLMs) have recently shown impressive results on diverse code-related tasks, benefiting from large-scale training and instruction tuning. However, studies reveal that their grasp of fundamental programming concepts,…

Software Engineering · Computer Science 2025-08-19 Xiaoning Ren , Qiang Hu , Wei Ma , Yan Li , Yao Zhang , Lingxiao Jiang , Yinxing Xue

GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation

Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during…

Computation and Language · Computer Science 2025-02-27 Jie He , Jennifer Neville , Mengting Wan , Longqi Yang , Hui Liu , Xiaofeng Xu , Xia Song , Jeff Z. Pan , Pei Zhou

GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding

Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code…

Computation and Language · Computer Science 2023-11-17 Andor Diera , Abdelhalim Dahou , Lukas Galke , Fabian Karl , Florian Sihler , Ansgar Scherp

Genetic Learning for Designing Sim-to-Real Data Augmentations

Data augmentations are useful in closing the sim-to-real domain gap when training on synthetic data. This is because they widen the training data distribution, thus encouraging the model to generalize better to other domains. Many image…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Bram Vanherle , Nick Michiels , Frank Van Reeth

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes…

Computation and Language · Computer Science 2025-05-30 Wenhao Hu , Jinhao Duan , Chunchen Wei , Li Zhang , Yue Zhang , Kaidi Xu

GenCo: Generative Co-training for Generative Adversarial Networks with Limited Data

Training effective Generative Adversarial Networks (GANs) requires large amounts of training data, without which the trained models are usually sub-optimal with discriminator over-fitting. Several prior studies address this issue by…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Kaiwen Cui , Jiaxing Huang , Zhipeng Luo , Gongjie Zhang , Fangneng Zhan , Shijian Lu

MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level…

Machine Learning · Computer Science 2026-01-12 Jiefu Ou , Sapana Chaudhary , Kaj Bostrom , Nathaniel Weir , Shuai Zhang , Huzefa Rangwala , George Karypis