Computer Science

SpecBench: Evaluating Specification-Level Reasoning for Software Engineering LLM Agents

Software engineering (SWE) agents are transitioning from code generation to full software development lifecycle automation. A critical phase in this lifecycle is specification design: transforming initial proposals into carefully considered…

Multiagent Systems · Computer Science 2026-05-29 Grant Hamblin , Kevin Song , Zhanda Zhu , Anand Jayarajan , Sihang Liu , Nandita Vijaykumar , Gennady Pekhimenko

EASE Configuration Facilitates A Reproducible Science of LLM Social Simulations

LLMs are increasingly deployed to simulate social interactions, yet many of the existing simulators remain ad hoc and monolithic. This lack of architectural standardization prevents reproducible research and complicates downstream…

Multiagent Systems · Computer Science 2026-05-29 Sneheel Sarangi , Maximilian Puelma Touzel , Aurélien Bück-Kaeffer , Zachary Yang , Jean-François Godbout , Reihaneh Rabbany

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of…

Multiagent Systems · Computer Science 2026-05-29 Wenwu Li , Yuran Song , Mingze Zhao , Bo Jin , Wenhao Li

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that…

Software Engineering · Computer Science 2026-05-29 Chris Adams , Arjun Singh Banga , Parveen Bansal , Souvik Bhattacharya , Rujin Cao , Pedro Canahuati , Nate Cook , Brian Ellis , Prabhakar Goyal , Gurinder Grewal , Tianyu He , Matt Labunka , Alex Manners , David Molnar , Ging Cee Ng , Vishal Parekh , Jiefu Pei , Frederic Sagnes , James Saindon , Will Shackleton , Sid Sidhu , Gursharan Singh , Karthik Chengayan Sridhar , Matt Steiner , Pratibha Udmalpet , Sean Xia , Stacey Yan , Audris Mockus , Peter Rigby , Nachiappan Nagappan

EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

Large Language Models (LLMs) have shown promise for automated vulnerability repair (AVR), but they still face several limitations, including the lack of intra-vulnerability experience accumulation and the lack of cross-vulnerability…

Software Engineering · Computer Science 2026-05-29 Haichuan Hu , Guoqing Xie , Quanjun Zhang , Jiawei Liu , Shengcheng Yu , Chunrong Fang , Zhenyu Chen , Liang Xiao

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more…

Multiagent Systems · Computer Science 2026-05-29 Corrado Rainone , Davide Belli , Bence Major , Arash Behboodi

Projectional Decoding: Towards Semantic-Aware LLM Generation

Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring the semantic validity of these artifacts remains a fundamental challenge. Existing constrained…

Software Engineering · Computer Science 2026-05-29 Boqi Chen , José Antonio Hernández López , Aren A. Babikian

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the…

Software Engineering · Computer Science 2026-05-29 Parsa Mazaheri

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$…

Multiagent Systems · Computer Science 2026-05-29 Víctor Gallego

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM-based approaches show promise in code analysis, they struggle with deep…

Software Engineering · Computer Science 2026-05-29 Xiang Liu , Sa Song , Zhaowei Zhang , Huiying Lan , Jason Zeng , Ming Wu , Michael Heinrich , Yong Sun , Ceyao Zhang

Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension

Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for…

Multiagent Systems · Computer Science 2026-05-29 Francisco León Zúñiga Bolívar

TagDebt: A Bot to Support Technical Debt Management

Context: Technical debt (TD) is a widely studied metaphor that helps to explain how sub-optimal decisions that can harm software maintainability over time. Although incurring TD is not intrinsically bad, tracking and managing TD are crucial…

Software Engineering · Computer Science 2026-05-29 João Paulo Biazotto , Daniel Feitosa , Paris Avgeriou , Elisa Yumi Nakagawa

Inferring Code Correctness from Specification

Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge.…

Software Engineering · Computer Science 2026-05-29 Tambon Florian , Papadakis Mike

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate…

Multiagent Systems · Computer Science 2026-05-29 Zhezheng Hao , Tianfu Wang , Huanshuo Dong , Ziyan Liu , Hong Wang , Xiankun Lin , Qiang Lin , Can Wang , Hande Dong , Jiawei Chen

CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems

Although large language model (LLM) based multi-agent systems (MAS) show their capability to solve complex tasks and achieve higher performance over single agent systems, they lead to huge computational overheads because of heavy…

Multiagent Systems · Computer Science 2026-05-29 Ziyang Ma , Dingyi Zhang , Sichu Liang , Jiajia Chu , Pengfei Xia , Hui Zang , Deyu Zhou

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously navigate an application and discover defects through its own interaction. However, current evaluation…

Software Engineering · Computer Science 2026-05-29 Xiaoyi Chen , Yifei Gao , Yang Xu , Xingxing Song , Yi Zhang , Jitao Sang

DynaGraph: Lightweight Multi-Model Interaction Framework via Dynamic Topological Reconfiguration

Tackling complex reasoning tasks typically relies on massive monolithic LLMs, which suffer from severe computational redundancy. While task decomposition through structured pipelines or multi-agent collaborations offers an alternative,…

Multiagent Systems · Computer Science 2026-05-29 Yanxing Guo , Zihao Zheng , Fangzhou Wu , Ling Liang , Lin Bao , Zongwei Wang , Yimao Cai

CODEFUSE-DEBENCH: An Empirical Study on Readability, Recompilability, and Functionality

Binary decompilation aims to recover binaries into high-level source code, but existing evaluations mainly rely on syntactic similarity or single-axis readability metrics, which fail to capture practical reusability. We propose a…

Software Engineering · Computer Science 2026-05-29 Puzhuo Liu , Yuhan Huang , Jianlei Chi , Peng Di , Yu Jiang

Usability Analysis of Configurator User Interfaces with Multimodal Large Language Models

Configuration is a key technology for tailoring complex software systems, services, and products. A successful application of configurators not only depends on technical correctness, performance, and domain modeling but also on their…

Software Engineering · Computer Science 2026-05-29 Sebastian Lubos , Alexander Felfernig , Damian Garber , Adnan Kraljić , Tarik Kraljić , Viet-Man Le , Thi Ngoc Trang Tran , Gerhard Leitner , Julian Schwazer , Doris Suppan , Reinhard Willfort , Ivan Dukic , Jeremias Fuchs , Manuel Henrich

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of…

Software Engineering · Computer Science 2026-05-29 Ningzhi Tang , Chaoran Chen , Gelei Xu , Yiyu Shi , Yu Huang , Collin McMillan , Tao Dong , Toby Jia-Jun Li