Related papers: RECODE-H: A Benchmark for Research Code Developmen…

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions,…

Software Engineering · Computer Science 2025-02-28 Hojae Han , Seung-won Hwang , Rajhans Samdani , Yuxiong He

Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols

Effective feedback is essential for fostering students' success in scientific inquiry. With advancements in artificial intelligence, large language models (LLMs) offer new possibilities for delivering instant and adaptive feedback. However,…

Artificial Intelligence · Computer Science 2025-02-19 Kathrin Seßler , Arne Bewersdorff , Claudia Nerdel , Enkelejda Kasneci

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their…

Software Engineering · Computer Science 2026-02-27 Dekun Dai , MingWei Liu , Anji Li , Jialun Cao , Yanlin Wang , Chong Wang , Xin Peng , Zibin Zheng

LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research

Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP…

Software Engineering · Computer Science 2025-06-24 Shuo Yan , Ruochen Li , Ziming Luo , Zimu Wang , Daoyang Li , Liqiang Jing , Kaiyu He , Peilin Wu , George Michalopoulos , Yue Zhang , Ziyang Zhang , Mian Zhang , Zhiyu Chen , Xinya Du

Leveraging Large Language Models for Automated Reproduction of Networking Research Results

Code reproduction is a cornerstone of scientific validity, yet it remains a formidable challenge in computer networking research due to the scarcity of open-source implementations and the complexity of heterogeneous system architectures.…

Networking and Internet Architecture · Computer Science 2026-02-17 Yining Jiang , Yunxin Xu , Wenyun Xu , Yufan Zhu , Tangtang He , Haiying Huang , Letian Zhu , Qingyu Song , Qiang Su , Lizhao You , Lu Tang , Wanjin Feng , Yuchao Zhang , Linghe Kong , Qiao Xiang , Jiwu Shu

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with…

Computation and Language · Computer Science 2024-11-14 Jierui Li , Hung Le , Yingbo Zhou , Caiming Xiong , Silvio Savarese , Doyen Sahoo

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In…

Computation and Language · Computer Science 2024-06-13 Jason Wu , Eldon Schoop , Alan Leung , Titus Barik , Jeffrey P. Bigham , Jeffrey Nichols

Can large language models provide useful feedback on research papers? A large-scale empirical analysis

Expert feedback lays the foundation of rigorous research. However, the rapid growth of scholarly production and intricate knowledge specialization challenge the conventional scientific feedback mechanisms. High-quality peer reviews are…

Machine Learning · Computer Science 2023-10-04 Weixin Liang , Yuhui Zhang , Hancheng Cao , Binglu Wang , Daisy Ding , Xinyu Yang , Kailas Vodrahalli , Siyu He , Daniel Smith , Yian Yin , Daniel McFarland , James Zou

CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants

Large Language Models (LLM) are increasingly used for software development, yet existing benchmarks for LLM-based coding assistance do not reflect the constraints of High Energy Physics (HEP) and High Performance Computing (HPC) software.…

High Energy Physics - Experiment · Physics 2026-03-03 Mohammad Atif , Kriti Chopra , Fang-Ying Tsai , Ozgur O. Kilic , Tianle Wang , Zhihua Dong , Douglas Benjamin , Charles Leggett , Meifeng Lin , Paolo Calafiura , Salman Habib

Re4: Scientific Computing Agent with Rewriting, Resolution, Review and Revision

Large language models (LLMs) serve as an active and promising field of generative artificial intelligence and have demonstrated abilities to perform complex tasks in multiple domains, including mathematical and scientific reasoning. In this…

Artificial Intelligence · Computer Science 2026-03-03 Ao Cheng , Lei Zhang , Guowei He

Human-Aligned Enhancement of Programming Answers with LLMs Guided by User Feedback

Large Language Models (LLMs) are widely used to support software developers in tasks such as code generation, optimization, and documentation. However, their ability to improve existing programming answers in a human-like manner remains…

Software Engineering · Computer Science 2026-01-27 Suborno Deb Bappon , Saikat Mondal , Chanchal K. Roy , Kevin Schneider

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach

Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to…

Software Engineering · Computer Science 2025-06-02 Melika Sepidband , Hamed Taherkhani , Song Wang , Hadi Hemmati

ReCode: Updating Code API Knowledge with Reinforcement Learning

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their…

Computation and Language · Computer Science 2025-11-25 Haoze Wu , Yunzhi Yao , Wenhao Yu , Ningyu Zhang

Human-AI Collaboration in Science at Scale: A Global Large-scale Randomized Field Experiment

Collaboration is the defining mode of modern science, yet its core mechanism -- feedback -- remains hard to observe, difficult to scale, and unequally distributed. Here we test whether large language models (LLMs) can contribute to this…

Physics and Society · Physics 2026-05-26 Binglu Wang , Weixin Liang , Jiahui Xue , Yuhui Zhang , Hancheng Cao , Dashun Wang , Yian Yin

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

Large Language Models (LLMs) have shown promise in automated code generation but typically excel only in simpler tasks such as generating standalone code units. Real-world software development, however, often involves complex code…

Software Engineering · Computer Science 2024-08-12 Kechi Zhang , Jia Li , Ge Li , Xianjie Shi , Zhi Jin

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it…

Computation and Language · Computer Science 2025-11-04 Taku Mikuriya , Tatsuya Ishigaki , Masayuki Kawarada , Shunya Minami , Tadashi Kadowaki , Yohichi Suzuki , Soshun Naito , Shunya Takata , Takumi Kato , Tamotsu Basseda , Reo Yamada , Hiroya Takamura

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic…

Computation and Language · Computer Science 2025-08-08 Yanzheng Xiang , Hanqi Yan , Shuyin Ouyang , Lin Gui , Yulan He

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to…

Computation and Language · Computer Science 2025-02-19 Jonas Gehring , Kunhao Zheng , Jade Copet , Vegard Mella , Quentin Carbonneaux , Taco Cohen , Gabriel Synnaeve