Related papers: Code Needs Comments: Enhancing Code LLMs with Comm…

Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks

Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model…

Software Engineering · Computer Science 2025-04-29 Kang Yang , Xinjun Mao , Shangwen Wang , Yanlin Wang , Tanghaoran Zhang , Bo Lin , Yihao Qin , Zhang Zhang , Yao Lu , Kamal Al-Sabahi

Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation

Generating accurate code review comments remains a significant challenge due to the inherently diverse and non-unique nature of the task output. Large language models pretrained on both programming and natural language data tend to perform…

Software Engineering · Computer Science 2024-11-18 Md. Asif Haider , Ayesha Binte Mostofa , Sk. Sabit Bin Mosaddek , Anindya Iqbal , Toufique Ahmed

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could…

Computation and Language · Computer Science 2025-02-03 Yaping Chai , Haoran Xie , Joe S. Qin

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Including code in the pre-training data mixture, even for models not specifically designed for code, has become a common practice in LLMs pre-training. While there has been anecdotal consensus among practitioners that code data plays a…

Computation and Language · Computer Science 2024-08-21 Viraat Aryabumi , Yixuan Su , Raymond Ma , Adrien Morisot , Ivan Zhang , Acyr Locatelli , Marzieh Fadaee , Ahmet Üstün , Sara Hooker

Revisiting the Role of Natural Language Code Comments in Code Translation

The advent of large language models (LLMs) has ushered in a new era in automated code translation across programming languages. Since most code-specific LLMs are pretrained on well-commented code from large repositories like GitHub, it is…

Software Engineering · Computer Science 2026-01-26 Monika Gupta , Ajay Meena , Anamitra Roy Choudhury , Vijay Arya , Srikanta Bedathur

Automating Patch Set Generation from Code Review Comments Using Large Language Models

The advent of Large Language Models (LLMs) has revolutionized various domains of artificial intelligence, including the realm of software engineering. In this research, we evaluate the efficacy of pre-trained LLMs in replicating the tasks…

Software Engineering · Computer Science 2024-06-10 Tajmilur Rahman , Rahul Singh , Mir Yousuf Sultan

Leveraging Design-Aware Context in Large Language Models for Code Comment Generation

Comments are very useful to the flow of code development. With the increasing commonality of code, novice coders have been creating a significant amount of codebases. Due to lack of commenting standards, their comments are often useless,…

Software Engineering · Computer Science 2026-05-05 Aritra Mitra , Srijoni Majumdar , Anamitra Mukhopadhyay , Partha Pratim Das , Paul D Clough , Partha Pratim Chakrabarti

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning

Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that…

Software Engineering · Computer Science 2023-06-16 Mingyang Geng , Shangwen Wang , Dezun Dong , Haotian Wang , Ge Li , Zhi Jin , Xiaoguang Mao , Xiangke Liao

Exploring the Capabilities of LLMs for Code Change Related Tasks

Developers deal with code-change-related tasks daily, e.g., reviewing code. Pre-trained code and code-change-oriented models have been adapted to help developers with such tasks. Recently, large language models (LLMs) have shown their…

Software Engineering · Computer Science 2024-07-04 Lishui Fan , Jiakun Liu , Zhongxin Liu , David Lo , Xin Xia , Shanping Li

Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review…

Software Engineering · Computer Science 2025-08-14 Linh Nguyen , Chunhua Liu , Hong Yi Lin , Patanamon Thongtanunam

Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation

Code review is an important practice in software development, yet it is time-consuming and requires substantial effort. While open-source datasets have been used to train neural models for automating code review tasks, including review…

Software Engineering · Computer Science 2025-02-07 Chunhua Liu , Hong Yi Lin , Patanamon Thongtanunam

An Empirical Study on Influence-Based Pretraining Data Selection for Code Large Language Models

Recent advancements in code large language models (Code-LLMs) have demonstrated remarkable capabilities in resolving programming related tasks. Meanwhile, researchers have recognized that the quality of pre-training data is crucial for…

Software Engineering · Computer Science 2026-04-10 Chengli Xing , Zhengran Zeng , Gexiang Fang , Rui Xie , Wei Ye , Shikun Zhang

Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub),…

Software Engineering · Computer Science 2025-01-16 Xin Yin , Chao Ni , Xiaodan Xu , Xinrui Li , Xiaohu Yang

A study of the impact of generative AI-based data augmentation on software metadata classification

This paper presents the system submitted by the team from IIT(ISM) Dhanbad in FIRE IRSE 2023 shared task 1 on the automatic usefulness prediction of code-comment pairs as well as the impact of Large Language Model(LLM) generated data on…

Software Engineering · Computer Science 2023-10-24 Tripti Kumari , Chakali Sai Charan , Ayan Das

High-quality data augmentation for code comment classification

Code comments serve a crucial role in software development for documenting functionality, clarifying design choices, and assisting with issue tracking. They capture developers' insights about the surrounding source code, serving as an…

Software Engineering · Computer Science 2026-01-28 Thomas Borsani , Andrea Rosani , Giuseppe Di Fatta

Better Language Models of Code through Self-Improvement

Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is…

Computation and Language · Computer Science 2023-05-11 Hung Quoc To , Nghi D. Q. Bui , Jin Guo , Tien N. Nguyen

Enhancing Code Generation for Low-Resource Languages: No Silver Bullet

The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource…

Software Engineering · Computer Science 2025-02-03 Alessandro Giagnorio , Alberto Martin-Lopez , Gabriele Bavota

Enhancing Code Intelligence Tasks with ChatGPT

Pre-trained code models have emerged as crucial tools in various code intelligence tasks. However, their effectiveness depends on the quality of the pre-training dataset, particularly the human reference comments, which serve as a bridge…

Software Engineering · Computer Science 2023-12-27 Kang Yang , Xinjun Mao , Shangwen Wang , Tanghaoran Zhang , Bo Lin , Yanlin Wang , Yihao Qin , Zhang Zhang , Xiaoguang Mao

Harnessing Large Language Models for Curated Code Reviews

In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes that ensure an efficient code review process. Well-crafted comments not only streamline the code…

Software Engineering · Computer Science 2025-02-06 Oussama Ben Sghaier , Martin Weyssow , Houari Sahraoui

Empowering Large Language Models for Textual Data Augmentation

With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on…

Computation and Language · Computer Science 2024-04-30 Yichuan Li , Kaize Ding , Jianling Wang , Kyumin Lee