Related papers: Commit0: Library Generation from Scratch

Learning to Commit: Generating Organic Pull Requests via Online Repository Memory

Large language model (LLM)-based coding agents achieve impressive results on controlled benchmarks yet routinely produce pull requests that real maintainers reject. The root cause is not functional incorrectness but a lack of organicity:…

Software Engineering · Computer Science 2026-03-30 Mo Li , L. H. Xu , Qitai Tan , Ting Cao , Yunxin Liu

Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios

Large Language Models (LLMs) are driving a shift towards intent-driven development, where agents build complete software from scratch. However, existing benchmarks fail to assess this 0-to-1 generation capability due to two limitations:…

Software Engineering · Computer Science 2026-04-09 Ruida Hu , Xinchen Wang , Chao Peng , Cuiyun Gao , David Lo

ProgramBench: Can Language Models Rebuild Programs From Scratch?

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings…

Software Engineering · Computer Science 2026-05-06 John Yang , Kilian Lieret , Jeffrey Ma , Parth Thakkar , Dmitrii Pedchenko , Sten Sootla , Emily McMilin , Pengcheng Yin , Rui Hou , Gabriel Synnaeve , Diyi Yang , Ofir Press

From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking

Recent progress in autonomous code generation has fueled excitement around AI agents capable of accelerating scientific discovery by running experiments. However, there is currently no benchmark that evaluates whether such agents can…

Artificial Intelligence · Computer Science 2025-06-25 Gyeongwon James Kim , Alex Wilf , Louis-Philippe Morency , Daniel Fried

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an…

Machine Learning · Computer Science 2025-11-21 Peng Xia , Kaide Zeng , Jiaqi Liu , Can Qin , Fang Wu , Yiyang Zhou , Caiming Xiong , Huaxiu Yao

CommitBench: A Benchmark for Commit Message Generation

Writing commit messages is a tedious daily task for many software developers, and often remains neglected. Automating this task has the potential to save time while ensuring that messages are informative. A high-quality dataset and an…

Computation and Language · Computer Science 2024-03-11 Maximilian Schall , Tamara Czinczoll , Gerard de Melo

RepoZero: Can LLMs Generate a Code Repository from Scratch?

Large Language Models (LLMs) have recently shown remarkable progress in code generation, yet their ability to construct complete software repositories from scratch remains poorly understood. A fundamental bottleneck is the lack of…

Software Engineering · Computer Science 2026-05-21 Zhaoxi Zhang , Yiming Xu , Jiahui Liang , Weikang Li , Xiaoshuai Chen , Liwei Qian , Xin Pei , Jizhou Huang , Run Sun , Yunfang Wu

app.build: A Production Framework for Scaling Agentic Prompt-to-App Generation with Environment Scaffolding

We present app.build (https://github.com/neondatabase/appdotbuild-agent), an open-source framework that improves LLM-based application generation through systematic validation and structured environments. Our approach combines multi-layered…

Artificial Intelligence · Computer Science 2026-01-13 Evgenii Kniazev , Arseny Kravchenko , Igor Rekun , James Broadhead , Nikita Shamgunov , Pranav Sah , Pratik Nichite , Ivan Yamshchikov

Experimenting with Multi-Agent Software Development: Towards a Unified Platform

Large language models are redefining software engineering by implementing AI-powered techniques throughout the whole software development process, including requirement gathering, software architecture, code generation, testing, and…

Software Engineering · Computer Science 2024-06-11 Malik Abdul Sami , Muhammad Waseem , Zeeshan Rasheed , Mika Saari , Kari Systä , Pekka Abrahamsson

GOAT: A Training Framework for Goal-Oriented Agent with Tools

Current approaches rely on zero-shot evaluation due to the absence of training data; while proprietary models such as GPT-4 exhibit strong reasoning capabilities, smaller open-source models remain ineffective at complex tool use. To address…

Artificial Intelligence · Computer Science 2026-05-05 Hyunji Min , Sangwon Jung , Junyoung Sung , Dosung Lee , Leekyeung Han , Paul Hongsuck Seo

Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents

Agentic AI systems can now generate code with remarkable fluency, but a fundamental question remains: \emph{does the generated code actually do what the user intended?} The gap between informal natural language requirements and precise…

Software Engineering · Computer Science 2026-03-19 Shuvendu K. Lahiri

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and…

Software Engineering · Computer Science 2025-11-14 Divyanshu Saxena , Rishikesh Maurya , Xiaoxuan Ou , Gagan Somashekar , Shachee Mishra Gupta , Arun Iyer , Yu Kang , Chetan Bansal , Aditya Akella , Saravan Rajmohan

Echo: Learning from Experience Data via User-Driven Refinement

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend…

Artificial Intelligence · Computer Science 2026-05-22 Hande Dong , Xiaoyun Liang , Jiarui Yu , Jiayi Lin , Changqing Ai , Feng Liu , Wenjun Zhang , Rongbi Wei , Chaofan Zhu , Linjie Che , Feng Wu , Xin Shen , Dexu Kong , Xiaotian Wang , Qiuyuan Chen , Bingxu An , Yueting Lei , Qiang Lin

Smarter Together: Creating Agentic Communities of Practice through Shared Experiential Learning

The transition from human-centric to agent-centric software development practices is disrupting existing knowledge sharing environments for software developers. Traditional peer-to-peer repositories and developer communities for shared…

Artificial Intelligence · Computer Science 2025-11-12 Valentin Tablan , Scott Taylor , Gabriel Hurtado , Kristoffer Bernhem , Anders Uhrenholt , Gabriele Farei , Karo Moilanen

Agent-based code generation for the Gammapy framework

Software code generation using Large Language Models (LLMs) is one of the most successful applications of modern artificial intelligence. Foundational models are very effective for popular frameworks that benefit from documentation,…

Software Engineering · Computer Science 2025-10-01 Dmitriy Kostunin , Vladimir Sotnikov , Sergo Golovachev , Abhay Mehta , Tim Lukas Holch , Elisa Jones

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks…

Software Engineering · Computer Science 2025-07-23 Diganta Misra , Nizar Islah , Victor May , Brice Rauby , Zihan Wang , Justine Gehring , Antonio Orvieto , Muawiz Chaudhary , Eilif B. Muller , Irina Rish , Samira Ebrahimi Kahou , Massimo Caccia

An Agentic Evaluation Framework for AI-Generated Scientific Code in PETSc

While large language models have significantly accelerated scientific code generation, comprehensively evaluating the generated code remains a major challenge. Traditional benchmarks reduce evaluation to test-case matching, an approach…

Artificial Intelligence · Computer Science 2026-03-18 Hong Zhang , Barry Smith , Satish Balay , Le Chen , Murat Keceli , Lois Curfman McInnes , Junchao Zhang

FABRIC: Framework for Agent-Based Realistic Intelligence Creation

Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments. Realizing these capabilities requires access to agentic data-structured interaction…

Artificial Intelligence · Computer Science 2025-10-22 Abhigya Verma , Seganrasan Subramanian , Nandhakumar Kandasamy , Naman Gupta

AI Coding Agents Need Better Compiler Remarks

Modern AI agents optimize programs by refactoring source code to trigger trusted compiler transformations. This preserves program semantics and reduces source code pollution, making the program easier to maintain and portable across…

Programming Languages · Computer Science 2026-04-16 Akash Deo , Simone Campanoni , Tommy McMichen

Agents' Room: Narrative Generation through Multi-step Collaboration

Writing compelling fiction is a multifaceted process combining elements such as crafting a plot, developing interesting characters, and using evocative language. While large language models (LLMs) show promise for story writing, they…

Computation and Language · Computer Science 2025-03-17 Fantine Huot , Reinald Kim Amplayo , Jennimaria Palomaki , Alice Shoshana Jakobovits , Elizabeth Clark , Mirella Lapata