English
Related papers

Related papers: GenCode: A Generic Data Augmentation Framework for…

200 papers

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is…

Computation and Language · Computer Science 2023-05-11 Hung Quoc To , Nghi D. Q. Bui , Jin Guo , Tien N. Nguyen

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their…

Computation and Language · Computer Science 2025-11-25 Haoze Wu , Yunzhi Yao , Wenhao Yu , Ningyu Zhang

Inspired by the great success of Deep Neural Networks (DNNs) in natural language processing (NLP), DNNs have been increasingly applied in source code analysis and attracted significant attention from the software engineering community. Due…

Software Engineering · Computer Science 2023-01-11 Zeming Dong , Qiang Hu , Yuejun Guo , Maxime Cordy , Mike Papadakis , Zhenya Zhang , Yves Le Traon , Jianjun Zhao

Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing…

Computation and Language · Computer Science 2026-03-10 Zongqian Li , Tengchao Lv , Shaohan Huang , Yixuan Su , Qinzheng Sun , Qiufeng Yin , Ying Xin , Scarlett Li , Lei Cui , Nigel Collier , Furu Wei

Current coding benchmarks often inflate Large Language Model (LLM) capabilities due to static paradigms and data contamination, enabling models to exploit statistical shortcuts rather than genuine reasoning. To address this, we introduce…

Software Engineering · Computer Science 2026-02-17 Xinyue Zheng , Haowei Lin , Shaofei Cai , Zilong Zheng , Yaodong Yang , Yitao Liang

Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires…

Software Engineering · Computer Science 2025-02-07 Zeming Dong , Qiang Hu , Yuejun Guo , Zhenya Zhang , Maxime Cordy , Mike Papadakis , Yves Le Traon , Jianjun Zhao

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in code-related tasks, such as code generation and automated program repair. Despite their promising performance, most existing approaches for code…

Software Engineering · Computer Science 2025-09-03 Yicong Zhao , Shisong Chen , Jiacheng Zhang , Zhixu Li

Data augmentation is widely used to enhance generalization in visual classification tasks. However, traditional methods struggle when source and target domains differ, as in domain adaptation, due to their inability to address domain gaps.…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Khawar Islam , Muhammad Zaigham Zaheer , Arif Mahmood , Karthik Nandakumar , Naveed Akhtar

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the…

Machine Learning · Computer Science 2026-05-21 Geon Lee , Bhuvesh Kumar , Clark Mingxuan Ju , Tong Zhao , Kijung Shin , Neil Shah , Liam Collins

The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and…

Computation and Language · Computer Science 2023-11-14 Terry Yue Zhuo , Zhou Yang , Zhensu Sun , Yufei Wang , Li Li , Xiaoning Du , Zhenchang Xing , David Lo

As text and code resources have expanded, large-scale pre-trained models have shown promising capabilities in code generation tasks, typically employing supervised fine-tuning with problem statement-program pairs. However, increasing model…

Computation and Language · Computer Science 2025-04-10 Nathanaël Beau , Benoît Crabbé

Large language models (LLMs) have recently shown impressive results on diverse code-related tasks, benefiting from large-scale training and instruction tuning. However, studies reveal that their grasp of fundamental programming concepts,…

Software Engineering · Computer Science 2025-08-19 Xiaoning Ren , Qiang Hu , Wei Ma , Yan Li , Yao Zhang , Lingxiao Jiang , Yinxing Xue

Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during…

Computation and Language · Computer Science 2025-02-27 Jie He , Jennifer Neville , Mengting Wan , Longqi Yang , Hui Liu , Xiaofeng Xu , Xia Song , Jeff Z. Pan , Pei Zhou

Language models can serve as a valuable tool for software developers to increase productivity. Large generative models can be used for code generation and code completion, while smaller encoder-only models are capable of performing code…

Computation and Language · Computer Science 2023-11-17 Andor Diera , Abdelhalim Dahou , Lukas Galke , Fabian Karl , Florian Sihler , Ansgar Scherp

Data augmentations are useful in closing the sim-to-real domain gap when training on synthetic data. This is because they widen the training data distribution, thus encouraging the model to generalize better to other domains. Many image…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Bram Vanherle , Nick Michiels , Frank Van Reeth

The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes…

Computation and Language · Computer Science 2025-05-30 Wenhao Hu , Jinhao Duan , Chunchen Wei , Li Zhang , Yue Zhang , Kaidi Xu

Training effective Generative Adversarial Networks (GANs) requires large amounts of training data, without which the trained models are usually sub-optimal with discriminator over-fitting. Several prior studies address this issue by…

Computer Vision and Pattern Recognition · Computer Science 2021-12-07 Kaiwen Cui , Jiaxing Huang , Zhipeng Luo , Gongjie Zhang , Fangneng Zhan , Shijian Lu

Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level…

Machine Learning · Computer Science 2026-01-12 Jiefu Ou , Sapana Chaudhary , Kaj Bostrom , Nathaniel Weir , Shuai Zhang , Huzefa Rangwala , George Karypis
‹ Prev 1 2 3 10 Next ›