Related papers: Evaluating Software Development Agents: Patch Patt…
AI-driven software development has rapidly advanced with the emergence of software development agents that leverage large language models (LLMs) to tackle complex, repository-level software engineering tasks. These agents go beyond just…
AI Agents have rapidly gained prominence in both research and industry as systems that extend large language models with planning, tool use, memory, and goal-directed action. Despite this progress, the development and maintenance of Agent…
Unlike traditional automation tools or static LLM-based systems, agents combine decision-making and tool utilization to accomplish complex tasks, showing great potential in software engineering. However, existing studies largely focus on…
The rapid adoption of AI coding agents for software development has raised important questions about the quality and maintainability of the code they produce. While prior studies have examined AI-generated source code, the impact of AI…
AI coding agents are rapidly transforming software engineering by performing tasks such as feature development, debugging, and testing. Despite their growing impact, the research community lacks a comprehensive dataset capturing how these…
Large language models are redefining software engineering by implementing AI-powered techniques throughout the whole software development process, including requirement gathering, software architecture, code generation, testing, and…
The arrival of large language models (LLMs) capable of multi-step reasoning, tool use, and long-horizon planning has produced a qualitative shift in software engineering. Where earlier code-completion tools such as GitHub Copilot operated…
Fine-tuning large language models for code editing has typically relied on mining commits and pull requests. The working hypothesis has been that commit messages describe human intent in natural language, and patches to code describe the…
The rise of AI agents is transforming how software can be built. The promise of agents is that developers might write code quicker, delegate multiple tasks to different agents, and even write a full piece of software purely out of natural…
AI-agents help developers in different coding tasks, such as developing new features, fixing bugs, and reviewing code. Developers can write a Github issue and assign it to an AI-agent like Copilot for implementation. Based on the issue and…
LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are…
Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS)…
Static benchmarks measure what AI agents can do at a fixed point in time but not how they are adopted, maintained, or experienced in deployment. We introduce AgentPulse, a continuous evaluation framework scoring 50 agents across 10 workload…
The rise of large language models (LLMs) has sparked a surge of interest in agents, leading to the rapid growth of agent frameworks. Agent frameworks are software toolkits and libraries that provide standardized components, abstractions,…
In the first half of 2025, coding agents have emerged as a category of development tools that have very quickly transitioned to the practice. Unlike ''traditional'' code completion LLMs such as Copilot, agents like Cursor, Claude Code, or…
Large language models (LLMs) and their agentic frameworks are increasingly adopted to perform development tasks such as automated program repair (APR). While prior work has identified security risks in LLM-generated code, most have focused…
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated…
Software Engineering Agents (SWE agents) can autonomously perform development tasks on benchmarks like SWE Bench, but still face challenges when tackling complex and ambiguous real-world tasks. Consequently, SWE agents are often designed to…
Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios.…
Large language model (LLM) based coding agents increasingly act as autonomous contributors that generate and merge pull requests, yet their real-world effects on software projects are unclear-especially compared with widely adopted…