Related papers: Incorporating Domain Knowledge through Task Augmen…

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

Improving Tree-Structured Decoder Training for Code Generation via Mutual Learning

Code generation aims to automatically generate a piece of code given an input natural language utterance. Currently, among dominant models, it is treated as a sequence-to-tree task, where a decoder outputs a sequence of actions…

Artificial Intelligence · Computer Science 2021-06-01 Binbin Xie , Jinsong Su , Yubin Ge , Xiang Li , Jianwei Cui , Junfeng Yao , Bin Wang

Generation-Augmented Query Expansion For Code Retrieval

Pre-trained language models have achieved promising success in code retrieval tasks, where a natural language documentation query is given to find the most relevant existing code snippet. However, existing models focus only on optimizing…

Software Engineering · Computer Science 2022-12-22 Dong Li , Yelong Shen , Ruoming Jin , Yi Mao , Kuan Wang , Weizhu Chen

Domain-Specific Text Generation for Machine Translation

Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly specialized projects, where there is hardly any parallel in-domain data. In such…

Computation and Language · Computer Science 2022-09-15 Yasmin Moslem , Rejwanul Haque , John D. Kelleher , Andy Way

Exploring Data Augmentation for Code Generation Tasks

Advances in natural language processing, such as transfer learning from pre-trained language models, have impacted how models are trained for programming language tasks too. Previous research primarily explored code pre-training and…

Computation and Language · Computer Science 2023-02-08 Pinzhen Chen , Gerasimos Lampouras

GenX: Mastering Code and Test Generation with Execution Feedback

Recent advancements in language modeling have enabled the translation of natural language into code, and the use of execution feedback to improve code generation. However, these methods often rely heavily on pre-existing test cases, which…

Software Engineering · Computer Science 2024-12-19 Nan Wang , Yafei Liu , Chen Chen , Haonan Lu

GenCode: A Generic Data Augmentation Framework for Boosting Deep Learning-Based Code Understanding

Pre-trained code models lead the era of code intelligence, with multiple models designed with impressive performance. However, one important problem, data augmentation for code data that automatically helps developers prepare training data…

Software Engineering · Computer Science 2026-01-29 Zeming Dong , Qiang Hu , Xiaofei Xie , Maxime Cordy , Mike Papadakis , Yves Le Traon , Jianjun Zhao

Boosting Source Code Learning with Text-Oriented Data Augmentation: An Empirical Study

Recent studies have demonstrated remarkable advancements in source code learning, which applies deep neural networks (DNNs) to tackle various software engineering tasks. Similar to other DNN-based domains, source code learning also requires…

Software Engineering · Computer Science 2025-02-07 Zeming Dong , Qiang Hu , Yuejun Guo , Zhenya Zhang , Maxime Cordy , Mike Papadakis , Yves Le Traon , Jianjun Zhao

CodeKGC: Code Language Model for Generative Knowledge Graph Construction

Current generative knowledge graph construction approaches usually fail to capture structural knowledge by simply flattening natural language into serialized texts or a specification language. However, large generative language model…

Computation and Language · Computer Science 2024-01-19 Zhen Bi , Jing Chen , Yinuo Jiang , Feiyu Xiong , Wei Guo , Huajun Chen , Ningyu Zhang

Enriching Source Code with Contextual Data for Code Completion Models: An Empirical Study

Transformer-based pre-trained models have recently achieved great results in solving many software engineering tasks including automatic code completion which is a staple in a developer's toolkit. While many have striven to improve the…

Computation and Language · Computer Science 2023-04-25 Tim van Dam , Maliheh Izadi , Arie van Deursen

Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization

Automatically generating human-readable text describing the functionality of a program is the intent of source code summarization. Although neural language models achieve significant performance in this field, they are limited by their…

Artificial Intelligence · Computer Science 2024-04-02 Tong Ye , Lingfei Wu , Tengfei Ma , Xuhong Zhang , Yangkai Du , Peiyu Liu , Shouling Ji , Wenhai Wang

Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark

With the rapid advancement of large language models (LLMs), extensive research has been conducted to investigate the code generation capabilities of LLMs. However, existing efforts primarily focus on general-domain tasks, leaving LLMs' code…

Software Engineering · Computer Science 2025-03-18 Dewu Zheng , Yanlin Wang , Ensheng Shi , Xilin Liu , Yuchi Ma , Hongyu Zhang , Zibin Zheng

Code Search based on Context-aware Code Translation

Code search is a widely used technique by developers during software development. It provides semantically similar implementations from a large code corpus to developers based on their queries. Existing techniques leverage deep learning…

Software Engineering · Computer Science 2022-02-17 Weisong Sun , Chunrong Fang , Yuchen Chen , Guanhong Tao , Tingxu Han , Quanjun Zhang

Learning to Compose Domain-Specific Transformations for Data Augmentation

Data augmentation is a ubiquitous technique for increasing the size of labeled training sets by leveraging task-specific data transformations that preserve class labels. While it is often easy for domain experts to specify individual…

Machine Learning · Statistics 2018-12-10 Alexander J. Ratner , Henry R. Ehrenberg , Zeshan Hussain , Jared Dunnmon , Christopher Ré

Better Language Models of Code through Self-Improvement

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is…

Computation and Language · Computer Science 2023-05-11 Hung Quoc To , Nghi D. Q. Bui , Jin Guo , Tien N. Nguyen

Multi-task Learning based Pre-trained Language Model for Code Completion

Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies…

Software Engineering · Computer Science 2021-01-01 Fang Liu , Ge Li , Yunfei Zhao , Zhi Jin

Domain Curricula for Code-Switched MT at MixMT 2022

In multilingual colloquial settings, it is a habitual occurrence to compose expressions of text or speech containing tokens or phrases of different languages, a phenomenon popularly known as code-switching or code-mixing (CMX). We present…

Computation and Language · Computer Science 2022-11-01 Lekan Raheem , Maab Elrashid

Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond

We propose task-adaptive tokenization as a way to adapt the generation pipeline to the specifics of a downstream task and enhance long-form generation in mental health. Inspired by insights from cognitive science, our task-adaptive…

Computation and Language · Computer Science 2023-11-14 Siyang Liu , Naihao Deng , Sahand Sabour , Yilin Jia , Minlie Huang , Rada Mihalcea

Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing…

Computation and Language · Computer Science 2020-04-21 Frank F. Xu , Zhengbao Jiang , Pengcheng Yin , Bogdan Vasilescu , Graham Neubig

TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills

Code pre-trained models (CodePTMs) have recently demonstrated a solid capacity to process various software intelligence tasks, e.g., code clone detection, code translation, and code summarization. The current mainstream method that deploys…

Software Engineering · Computer Science 2024-05-10 Qiushi Sun , Nuo Chen , Jianing Wang , Xiang Li , Ming Gao