Related papers: DA-Code: Agent Data Science Code Generation Benchm…

A Survey on Code Generation with LLM-based Agents

Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1)…

Software Engineering · Computer Science 2025-10-01 Yihong Dong , Xue Jiang , Jiaru Qian , Tian Wang , Kechi Zhang , Zhi Jin , Ge Li

DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI…

Artificial Intelligence · Computer Science 2025-04-14 Liqiang Jing , Zhehui Huang , Xiaoyang Wang , Wenlin Yao , Wenhao Yu , Kaixin Ma , Hongming Zhang , Xinya Du , Dong Yu

DSCodeBench: A Realistic Benchmark for Data Science Code Generation

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic…

Software Engineering · Computer Science 2025-11-18 Shuyin Ouyang , Dong Huang , Jingwen Guo , Zeyu Sun , Qihao Zhu , Jie M. Zhang

IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis

Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce…

Computation and Language · Computer Science 2025-06-09 Hanyu Li , Haoyu Liu , Tingyu Zhu , Tianyu Guo , Zeyu Zheng , Xiaotie Deng , Michael I. Jordan

CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges

Large Language Models (LLMs) have shown promise in automated code generation but typically excel only in simpler tasks such as generating standalone code units. Real-world software development, however, often involves complex code…

Software Engineering · Computer Science 2024-08-12 Kechi Zhang , Jia Li , Ge Li , Xianjie Shi , Zhi Jin

Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks

Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks, yet they face significant limitations in handling complex, long-context programming challenges and demonstrating complex compositional reasoning…

Artificial Intelligence · Computer Science 2025-01-14 Amr Almorsi , Mohanned Ahmed , Walid Gomaa

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with…

Computation and Language · Computer Science 2024-11-14 Jierui Li , Hung Le , Yingbo Zhou , Caiming Xiong , Silvio Savarese , Doyen Sahoo

DataSciBench: An LLM Agent Benchmark for Data Science

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and…

Computation and Language · Computer Science 2025-02-20 Dan Zhang , Sining Zhoubian , Min Cai , Fengzu Li , Lekang Yang , Wei Wang , Tianjiao Dong , Ziniu Hu , Jie Tang , Yisong Yue

FormulaCode: Evaluating Agentic Optimization on Large Codebases

Large language model (LLM) coding agents increasingly operate at the repository level, motivating benchmarks that evaluate their ability to optimize entire codebases under realistic constraints. Existing code benchmarks largely rely on…

Software Engineering · Computer Science 2026-05-18 Atharva Sehgal , James Hou , Akanksha Sarkar , Ishaan Mantripragada , Swarat Chaudhuri , Jennifer J. Sun , Yisong Yue

DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

In this work, we investigate the potential of large language models (LLMs) based agents to automate data science tasks, with the goal of comprehending task requirements, then building and training the best-fit machine learning models.…

Machine Learning · Computer Science 2024-05-29 Siyuan Guo , Cheng Deng , Ying Wen , Hechang Chen , Yi Chang , Jun Wang

Large Language Model-based Data Science Agent: A Survey

The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents…

Artificial Intelligence · Computer Science 2025-11-25 Ke Chen , Peiran Wang , Yaoning Yu , Xianyang Zhan , Haohan Wang

Generating Unseen Code Tests In Infinitum

Large Language Models (LLMs) are used for many tasks, including those related to coding. An important aspect of being able to utilize LLMs is the ability to assess their fitness for specific usages. The common practice is to evaluate LLMs…

Artificial Intelligence · Computer Science 2024-07-30 Marcel Zalmanovici , Orna Raz , Eitan Farchi , Iftach Freund

AI-powered Code Review with LLMs: Early Results

In this paper, we present a novel approach to improving software quality and efficiency through a Large Language Model (LLM)-based model designed to review code and identify potential issues. Our proposed LLM-based AI agent model is trained…

Software Engineering · Computer Science 2025-12-11 Zeeshan Rasheed , Malik Abdul Sami , Muhammad Waseem , Kai-Kristian Kemell , Xiaofeng Wang , Anh Nguyen , Kari Systä , Pekka Abrahamsson

SciNav: A General Agent Framework for Scientific Coding Tasks

Autonomous science agents built on large language models (LLMs) are increasingly used to generate hypotheses, design experiments, and produce reports. However, prior work mainly targets open-ended scientific problems with subjective outputs…

Computation and Language · Computer Science 2026-03-24 Tianshu Zhang , Huan Sun

DUALGUAGE: Automated Joint Security-Functionality Benchmarking for Secure Code Generation

Large language models (LLMs) and autonomous coding agents are increasingly used to generate software across a wide range of domains. Yet a core requirement remains unmet: ensuring that generated code is secure without compromising its…

Software Engineering · Computer Science 2025-11-27 Abhijeet Pathak , Suvadra Barua , Dinesh Gudimetla , Rupam Patir , Jiawei Guo , Hongxin Hu , Haipeng Cai

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has…

Software Engineering · Computer Science 2025-05-06 Nazmus Ashrafi , Salah Bouktif , Mohammed Mediani

From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future

With the rise of large language models (LLMs), researchers are increasingly exploring their applications in var ious vertical domains, such as software engineering. LLMs have achieved remarkable success in areas including code generation…

Software Engineering · Computer Science 2025-04-15 Haolin Jin , Linghan Huang , Haipeng Cai , Jun Yan , Bo Li , Huaming Chen

Can AI Agents Answer Your Data Questions? A Benchmark for Data Agents

Users across enterprises increasingly rely on AI agents to query their data through natural language. However, building reliable data agents remains difficult because real-world data is often fragmented across multiple heterogeneous…

Databases · Computer Science 2026-03-24 Ruiying Ma , Shreya Shankar , Ruiqi Chen , Yiming Lin , Sepanta Zeighami , Rajoshi Ghosh , Abhinav Gupta , Anushrut Gupta , Tanmai Gopal , Aditya G. Parameswaran

ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges

Autonomous coding agents built on large language models (LLMs) can now solve many general software and machine learning tasks, but they remain ineffective on complex, domain-specific scientific problems. Medical imaging is a particularly…

Computer Vision and Pattern Recognition · Computer Science 2025-12-22 Roshan Kenia , Xiaoman Zhang , Pranav Rajpurkar